haptork / easyLambda
- среда, 16 марта 2016 г. в 02:12:06
C++
data processing with modern C++ and MPI. Modular, parallel, based on data-flow with map and reduce.
data processing with modern C++ and MPI. Modular, parallel, based on data-flow with map and reduce.
The project started with the need for a standard way to process data with C++. The design goals are composability, easy interface, decoupling IO, data-format and parallel code from algorithm logic, less boilerplate code, accessible to anyone who knows C. easyLambda achieves these goals with type-safe data-flow pipeline, map/reduce like operations, MPI parallelism; presented with an easy but powerful ExpressionBuilder interface made possible by use of modern C++ features.
Use ezl for your data-processing tasks, to write post-processors for simulation results, for iterative machine learning algorithms, for general list data processing or any data/task parallel code. You get an expressive way to compose the algorithms as a pipeline of small tasks, clean separation of various algorithm logics from i/o or parallelism, various built-in functions for common operations and MPI parallelism with no to little extra code. It makes your application clean, modular, parallel, expressive and all the good things in programming practices :). Check the examples listed below to know some of the ways ezl has been used.
You can use it along with other libraries like openCV/Dlib/thrust that work well with standard data-types like vector and tuple.
If you are a C++ enthusiast then possibly you will find the project quite interesting. Contributions and feedback of any kind are much appreciated. Please check contributing.md for more.
Here is a short example to begin with. The program calculates frequency of each word in the data files. Words are considered same irrespective of their case (upper or lower).
#include <string>
#include <boost/mpi.hpp>
#include "ezl/ezl.hpp"
#include "ezl/algorithms/readFile.hpp"
#include "ezl/algorithms/reduces.hpp"
int main(int argc, char* argv[]) {
using std::string;
using ezl::readFile;
boost::mpi::environment env(argc, argv);
ezl::rise(readFile<string>(argv[1]).rowSeparator('s').colSeparator(""))
.reduce<1>(ezl::count(), 0).dump()
.run();
return 0;
}
The data-flow starts with rise
and subsequent operations are added to the
pipeline. In the above example, the pipeline starts with reading data from
file(s). readFile
is a library function that takes column types and file(s)
glob pattern as input and reads the file(s) in parallel. It has a lot of
properties for controlling data-format, parallelism, denormalization etc
(shown in demoReadFile).
In reduce we pass the index of the key column, the library function for counting and initial value of the result. The wordcount example is too simple to show the library features.
Following is the data-flow for calculating pi using Monte-Carlo method.
ezl::rise(ezl::kick(10000)) // 10000 trials in total
.map([] {
auto x = rand01();
auto y = rand01();
return x*x + y*y;
})
.filter(ezl::lt(1.))
.reduce(ezl::count(), 0)
.map([](int inCircleCount) {
return (4.0 * inCircleCount / 10000);
}).colsTransform().dump()
.run();
The steps in the algorithm have been expressed with the composition of small
operations, some are common library functions like count()
, lt()
(less-than) and
some are user-defined functions specific to problem.
Not only the above examples are expressive and modular, they are highly efficient in serial as well as parallel execution, with close to linear speed-up with multiple cores or multiple nodes. The implementation aims at reducing number of copies of the data, which results in little to no overhead over a serial code written specifically to carry out the same operation.
You can find the above examples and many more in detail with benchmarking results in the examples directory. Examples include:
The examples directory also has separate demonstrations for features and options along with explanations to get started with ezl quickly. The demonstrations include:
The following figure shows the overview of parallel options for units in a pipeline.
The numbers inside the circle are process rank a unit is running on. for e.g.
first unit can be a readFile running on {0, 1} process ranks, {2,3,4} can be
running a map or reduce and so on. It can be seen that a reduce task is by
default parallel and map tasks are by default in-process. The prll option in
the units control the behaviour. The processes can be requested by number,
ratio of processes of parent unit, or exact rank of processes. If the requested
processes are not available then also the program runs correctly with best
possible allocation to units. demoPrll has detailed
examples and options on this. A lot of other demos and examples use prll
option with different units and options.
Following are some benchmark results on different problems.
The number of trials for pi are doubled as the number of processes are doubled, keeping the trials per process constant (weak scaling). In this case a constant line implies ideal parallelism. The logistic regression and wordcount benchmarks show decrease in time of execution unless the time is reduced to around a minute. For more info on benchmarks check the respective examples.
There are no restrictions on data-flow connections except the type of columns. The following figures demonstrates a circular data-flow and a diamond like data-flow pipelines:
Each of these tasks can be running on multiple processes, depending on the availability and options.
There can also be a data-flow running in user function of another data-flow. The data-flows can be joined, branched and built to run later multiple times on different data.
demoFlow shows code and details for the options discussed and for above two data-flow figures. Many other examples also use flow properties.
This is a header only library so all that is needed to start using is to place
the contents of the include directory within your source tree so that
it is available to the compiler. Include include/ezl.hpp in
your program. If you use algorithms like ezl::count
etc then also include
required files from include/ezl/algorithms/
directory.
There are no linking requirements of ezl library but it uses boost::serialization
and boost::mpi that need to be linked.
Here is how you can compile a program in general:
mpic++ file.cpp -Wall -std=c++14 -O3 -I path_to_ezl_include -lboost_mpi -lboost_serialization
If you have added the contents of include directory to your source tree or global includes then you don't need to pass -I path_to_ezl_include flag.
You can compile unit-tests with make unittest
and run with ./bin/unitTest
.
You can compile an example using make with make example fname=name
, in place
of name write the name of the file for e.g. wordcount without extension.
After compiling, the executable can be run with mpirun
mpirun -n 4 path_to_exe args…
or simply as path_to_exe args…
.
Check examples to begin writing with ezl.