C++ is one of the most widely used programming languages in the world. It is used specifically to make programs which must run fast.
Most of computers now are mono-processor, but it tends to change due to increasing performance needs.
For technical reasons, it is getting more and more difficult to have faster processors, with faster frequencies. Everyone can check that by looking at new computers : Their frequencies do not go beyond four gigahertz. An good explanation of this situation is explained on this well-known article : http://www.gotw.ca./publications/concurrency-ddj.htm
The problem is that the need for computing performance is still increasing, although the simple solution that existed before - buying a faster machine - will no longer be possible in a short number of years.
The only solution is to add parallelism to existing and new programs.
Yes. I started to work alone on this project and, by chance, I fell on this very interesting article which was describing much more than what I was doing. So, I try to implement what is described there. In in early stage of existence, this library was temporarily named ATL, for "Another Threaded Library".
If you want to use STL algorithms and multithreading, and keep the same architecture and design, then this library is made for you.
This library insists on the use of standard C++ containers, for several reasons:
Many attempts were done, most of them failed or not satisfying. This confirm that there is a need, but there are as well technical difficulties. Some examples are : STAPL, pSTL, HPC++, MPI, OpenMPI. Some of them have one or several of these drawbacks:
No. Only include files and a couple of very small .cpp files that you can inline or link as shared libraries or object files, as you wish. Everything is based on templates.
The functors used in algorithms must be thread-safe. Everything you pass them, basically.
There are a couple of bugs which are under control : They occur only on Cygwin, and are probably due to some optimizations problems. On the other hand, many features seem to work perfectly on different machines, architectures, etc… See the file BUGS.
RPA assumes your implementation of STL if thread-safe. For thread-safety, please refer to the definition given on the SGI web site. Shortly : Several readers and no writers. If there is one writer, s/he needs exclusive access.
Most of times, except during peaks, CPUs are idle, just make an average on one hour. If your machine is really that loaded, buy another one. Computers are much cheaper than the time spent developping software.
rpa::thread_crea
and rpa::thread_join
?The library provides thread types, which are conformant with the required thread interface, but behave purely in a sequential way. Their role is to help debugging. To use them, you just need to replace a thread type by another.
This is very easy: A thread is any object which has the methods create and join. A mutex is any object which has the method lock
, unlock
and try_lock
. In fact, try_lock
is even not really required at the moment.
You must first check whether you program works in sequential mode. For this, use only pseudo-threads, instead of real threads. Then recompile and relink. Your program must run smoothly: The threads type must be totally transparent for your application. Then, check whether you code is thread-safe. Do not hesitate to add a couple of mutexes if you are not sure, simply write as a comment that they may be superfluous. If your program works, then you can test it more and more thoroughly, based on the principle that all mutexes and threads are interchangeable. So, generously try many combinations of threads types, mutexes types, threads number, available processors, buffer types, etc : Your program should always work. But, if you have checked your code's thread safety, and your program fails, then this is probably my fault, sorry in advance.
Yes. Most of times, the order of results is not guaranteed, and is different from the sequential execution. This may be annoying, but : Very often, the order of results is not relevant. It could have been possible to restore the original results order, at a price in speed and complexity that many users may not wish to pay. It is very often possible to restore this order: Sorting, indices, etc... There are specific scheduling methods which guaranty that the output order will be kept.
rpa::thread_tree
recursive ?
To treat the problem of parallel fucntions started from another parallel function. Create a rpa::thread_tree
at the top of your prgram and pass it to sub-programs. This way, if you are 'deep' in a call of a parallel function
calling another one, the deepest will simply use the sequential algorithm. On the other hand, if this function
is called 'at the top' of the program, it will use the top of the rpa::thread_tree
. There is another long-term
plan: Imagine that you have a cluster of NUMA computers: The top-level threads will be processes running on different
machines. The second level will be processes running on different CPU boards. And the third level will
be different cores of the same processor.
std::accumulate
and std::transform
?
Only two algorithms are implemented yet, because I concentrate on these two ones to implement many concepts used in all algorithms: Reduction, usage of buffers, scheduling policy, etc… The same code will be used in other algorithms, but first, it is easier to make them work for one algorithm only. In the mean time, to help you, please note that you can implement std::for_each
and std::copy
with std::transform
, and std::count
with std::accumulate
. What is sure is that the next algorithm to be implemented is one of the std::remove
algorithms, because many things were designed especially for it.
Some tests programs take ages to compile because many different combinations of containers and buffers types and sizes are used, to have the best possible coverage of codes. Many combinations are possible.
Because to detect race conditions, you must run the code for a long time. More: The library is designed to avoid the creation of useless threads when there are not a lot of data. So, to make real tests, you need a lot data. This is the price for detecting, for example, race conditions although some sophisticated tools could probably do the job.
Because it makes compilation more complicated and it is already very complicated. We try to have the smallest possible code, because it has to compete with serialized code when only one processor is available too often, unfortunately.
std::back_inserter
with one buffer per thread. But what about std::inserter
and std::front_inserter
?You are right, it has to be made. The idea was first to develop the most commonly used of the three inserters, to be sure that the concept is right and that we can build on it.
Yes, for writing tests programs. There are already about twenty test programs, but not all configurations, parameters combinations, compilers and architectures, could not be tested. Any new test is welcome.
Multi-threading involve many, many bottlenecks : Mutexes, context switching, complicated data structures (To bring type-safety), etc... So, sometimes, a multi-threaded program is slower than a sequential one. Of course, no one tells - especially if the brand now 16-processors machine costed 1000000$. More : Sometimes the speed-up can never be very big, even with a 100% idle machine, perfect mutexes, zero-cost context switch, etc... Exemple : Just try to compute the length of a list with two threads : One thread forward, another thread backward, starting from the end : The speedup will always be less that two. The lesson is that the overhead brought by the library must be as small as possible : There will always be situations when no thread is available, or there are too few elements to process, etc... and were all the multi-threading machinery is useless, and makes things slower the existing sequential algorithm. So, the multi-threading layer must be as thin as possible. This implies :
::new
and malloc
)
Yes. It is a deliberated choice.
It should. This platform is not supported yet, but everything is written in plain C++. Please be patient.
pipe_circular
?It is a circular pipeline built on any type of STL container. It needs a condition variable, any type which matches a very simple interface. It is compliant to the STL style, and allows to pipe the output of a RPA parallelized algorithn (or a plain STL algorithm) into another RPA parallelized algorithn (or a plain STL algorithm). It allows to make pipeline computing.
pipe_archiver
?
The same as a pipe_circular
except that it has an infinite size, so you do not
need an independent thread when reading from it. In other words, it never blocks.
It stores everything so the content
can be reused. It can have several readers and several writers. The elements can
be read/written one at a time, or slices by slices (This is also true for pipe_circular
.
Another difference with pipe_circular
is that this one never grows, and can
use as underlying container a C-style array.
Yes. We plan to implement pipe_infinite
which will behave like a pipe_archiver
but will reuse space as long as elements are read. The consequence is that it will not be
possible to read elements already read once. On the other hand, it will need less space.
Another project is pipe_truncate
which also has a limited size, but never blocks:
It simply keeps the N latest elements, in the order they were inserted,
or N 'biggest' elements according to a comparison function.
In other words, it is an asynchronous priority queue.
One may say the pipelines are not exactly in the scope of this project.
Yes/no: They intensively reuse existing primitives,
and allow to add pipeline processing between parallel algorithms.
future
made for ?It stores a task and the thread which is computing this taks. The result is what is calculated by the thread. It is very close to the usual concept of a future.
Programs are made of CPU-intensive parts and IO intensive parts. Unfortunately, parallelization libraries usually focus on CPU intensive parts - because the benefit is bigger and easier. Unfortunately, it often misses the point because performance bottelenecks are often, too often, in programs parts made of CPU and IO mixed together : Real-life programs love containers such as lists and sets more than arrays, because they are more flexible. This is why we aim at providing some sorts of parallelization even in the more extreme cases. For exemple :
Yes, sorry for that. There are plans to make a very complete documentation, but it takes time. A good source of documentation is using doxygen, because comments are intended to be picked up by doxygen.
Yes. Our pipe_circular
and pipe_archiver
are close to tbb::concurrent_queue
, and have nothing to do
with tbb::pipeline
. But the fact is that they try to mimic Unix's pipes: Data in and data out, following the insertion
order. Therefore the name.
Yes, many:
::fork()
, and joined with waitpid()
, instead of threads. This might be possible when a functor communicates with the main application, only with the input and output iterators. This can garanty an increased reliability. The difficulty is to find a clean way to return the result (Shared memory ? Unix pipe ?). This will imply a specialization of rpa::future
.
tbb::blocked_range
could be a new type of range, rpa::pipe_circular
and rpa::pipe_archiver
could have an interface close to tbb::concurrent_queue
, tbb::pipeline
could be used as a thread pool, hidden in a rpa::thread_tree
, the algorithms tbb::parallel_for
, tbb::parallel_while
etc... could be rewritten using RPA's approach, RPA could provide template specializations for TBB concurrent containers. Many orthogonal combinations could be possible, which is good, for a complicated problem such as parallelism.
rpa::row_buffer
like Unix pipes which are only character-oriented.
rpa::thread_crea
and rpa::thread_join
?
rpa::thread_tree
recursive ?
std::accumulate
and std::transform
?
std::back_inserter
with one buffer per thread. But what about std::inserter
and std::front_inserter
?
pipe_circular
?
pipe_archiver
?
future
made for ?