What is it all about ?

C++ is one of the most widely used programming languages in the world. It is used specifically to make programs which must run fast. Most of computers now are mono-processor, but it tends to change due to increasing performance needs. For technical reasons, it is getting more and more difficult to have faster processors, with faster frequencies. Everyone can check that by looking at new computers : Their frequencies do not go beyond four gigahertz. An good explanation of this situation is explained on this well-known article : http://www.gotw.ca./publications/concurrency-ddj.htm The problem is that the need for computing performance is still increasing, although the simple solution that existed before - buying a faster machine - will no longer be possible in a short number of years. The only solution is to add parallelism to existing and new programs.

Why this name? RPA ? Range Partition Adapters ? Do you owe something to http://www.stepanovpapers.com/p5-austern.pdf ?

Yes. I started to work alone on this project and, by chance, I fell on this very interesting article which was describing much more than what I was doing. So, I try to implement what is described there. In in early stage of existence, this library was temporarily named ATL, for "Another Threaded Library".

In one sentence, what is this library made for?

If you want to use STL algorithms and multithreading, and keep the same architecture and design, then this library is made for you.

Why insisting on the use of existing sequential components of C++, such as insertion iterators, lists, C-style array, when parallelised and distributed versions of these components are available?

This library insists on the use of standard C++ containers, for several reasons:

What is the 'plus' of this library ?

Many attempts were done, most of them failed or not satisfying. This confirm that there is a need, but there are as well technical difficulties. Some examples are : STAPL, pSTL, HPC++, MPI, OpenMPI. Some of them have one or several of these drawbacks:

Do I need external libraries?

No. Only include files and a couple of very small .cpp files that you can inline or link as shared libraries or object files, as you wish. Everything is based on templates.

What are the requirements on my code?

The functors used in algorithms must be thread-safe. Everything you pass them, basically.

There are bugs. Why releasing if there are still bugs ?

There are a couple of bugs which are under control : They occur only on Cygwin, and are probably due to some optimizations problems. On the other hand, many features seem to work perfectly on different machines, architectures, etc… See the file BUGS.

What are the requirements for my STL library?

RPA assumes your implementation of STL if thread-safe. For thread-safety, please refer to the definition given on the SGI web site. Shortly : Several readers and no writers. If there is one writer, s/he needs exclusive access.

I cannot know in advance how many processors are available, because my machine may be loaded.

Most of times, except during peaks, CPUs are idle, just make an average on one hour. If your machine is really that loaded, buy another one. Computers are much cheaper than the time spent developping software.

What are pseudo-threads such as rpa::thread_crea and rpa::thread_join ?

The library provides thread types, which are conformant with the required thread interface, but behave purely in a sequential way. Their role is to help debugging. To use them, you just need to replace a thread type by another.

I want to use my own threads and my own mutex.

This is very easy: A thread is any object which has the methods create and join. A mutex is any object which has the method lock, unlock and try_lock. In fact, try_lock is even not really required at the moment.

What should I do if my program crashes?

You must first check whether you program works in sequential mode. For this, use only pseudo-threads, instead of real threads. Then recompile and relink. Your program must run smoothly: The threads type must be totally transparent for your application. Then, check whether you code is thread-safe. Do not hesitate to add a couple of mutexes if you are not sure, simply write as a comment that they may be superfluous. If your program works, then you can test it more and more thoroughly, based on the principle that all mutexes and threads are interchangeable. So, generously try many combinations of threads types, mutexes types, threads number, available processors, buffer types, etc : Your program should always work. But, if you have checked your code's thread safety, and your program fails, then this is probably my fault, sorry in advance.

What about results order? The order of results is changed!!

Yes. Most of times, the order of results is not guaranteed, and is different from the sequential execution. This may be annoying, but : Very often, the order of results is not relevant. It could have been possible to restore the original results order, at a price in speed and complexity that many users may not wish to pay. It is very often possible to restore this order: Sorting, indices, etc... There are specific scheduling methods which guaranty that the output order will be kept.

Why are rpa::thread_tree recursive ?

To treat the problem of parallel fucntions started from another parallel function. Create a rpa::thread_tree at the top of your prgram and pass it to sub-programs. This way, if you are 'deep' in a call of a parallel function calling another one, the deepest will simply use the sequential algorithm. On the other hand, if this function is called 'at the top' of the program, it will use the top of the rpa::thread_tree. There is another long-term plan: Imagine that you have a cluster of NUMA computers: The top-level threads will be processes running on different machines. The second level will be processes running on different CPU boards. And the third level will be different cores of the same processor.

Why only two algorithms are implemented, std::accumulate and std::transform ?

Only two algorithms are implemented yet, because I concentrate on these two ones to implement many concepts used in all algorithms: Reduction, usage of buffers, scheduling policy, etc… The same code will be used in other algorithms, but first, it is easier to make them work for one algorithm only. In the mean time, to help you, please note that you can implement std::for_each and std::copy with std::transform, and std::count with std::accumulate. What is sure is that the next algorithm to be implemented is one of the std::remove algorithms, because many things were designed especially for it.

Why are the tests so slow to compile?

Some tests programs take ages to compile because many different combinations of containers and buffers types and sizes are used, to have the best possible coverage of codes. Many combinations are possible.

Why these tests run so slowly?

Because to detect race conditions, you must run the code for a long time. More: The library is designed to avoid the creation of useless threads when there are not a lot of data. So, to make real tests, you need a lot data. This is the price for detecting, for example, race conditions although some sophisticated tools could probably do the job.

Why not having used special types for detecting race conditions like Alexei Alexandrescu tools?

Because it makes compilation more complicated and it is already very complicated. We try to have the smallest possible code, because it has to compete with serialized code when only one processor is available too often, unfortunately.

Fine, I can bufferize std::back_inserter with one buffer per thread. But what about std::inserter and std::front_inserter ?

You are right, it has to be made. The idea was first to develop the most commonly used of the three inserters, to be sure that the concept is right and that we can build on it.

Is there any help needed ?

Yes, for writing tests programs. There are already about twenty test programs, but not all configurations, parameters combinations, compilers and architectures, could not be tested. Any new test is welcome.

Why so many templates, and so few virtual methods ?

Multi-threading involve many, many bottlenecks : Mutexes, context switching, complicated data structures (To bring type-safety), etc... So, sometimes, a multi-threaded program is slower than a sequential one. Of course, no one tells - especially if the brand now 16-processors machine costed 1000000$. More : Sometimes the speed-up can never be very big, even with a 100% idle machine, perfect mutexes, zero-cost context switch, etc... Exemple : Just try to compute the length of a list with two threads : One thread forward, another thread backward, starting from the end : The speedup will always be less that two. The lesson is that the overhead brought by the library must be as small as possible : There will always be situations when no thread is available, or there are too few elements to process, etc... and were all the multi-threading machinery is useless, and makes things slower the existing sequential algorithm. So, the multi-threading layer must be as thin as possible. This implies :

It does not compile with GCC 2.95

Yes. It is a deliberated choice.

It does not compile with Visual C++

It should. This platform is not supported yet, but everything is written in plain C++. Please be patient.

What is a pipe_circular ?

It is a circular pipeline built on any type of STL container. It needs a condition variable, any type which matches a very simple interface. It is compliant to the STL style, and allows to pipe the output of a RPA parallelized algorithn (or a plain STL algorithm) into another RPA parallelized algorithn (or a plain STL algorithm). It allows to make pipeline computing.

What is a pipe_archiver ?

The same as a pipe_circular except that it has an infinite size, so you do not need an independent thread when reading from it. In other words, it never blocks. It stores everything so the content can be reused. It can have several readers and several writers. The elements can be read/written one at a time, or slices by slices (This is also true for pipe_circular. Another difference with pipe_circular is that this one never grows, and can use as underlying container a C-style array.

Other types of pipelines ?

Yes. We plan to implement pipe_infinite which will behave like a pipe_archiver but will reuse space as long as elements are read. The consequence is that it will not be possible to read elements already read once. On the other hand, it will need less space. Another project is pipe_truncate which also has a limited size, but never blocks: It simply keeps the N latest elements, in the order they were inserted, or N 'biggest' elements according to a comparison function. In other words, it is an asynchronous priority queue. One may say the pipelines are not exactly in the scope of this project. Yes/no: They intensively reuse existing primitives, and allow to add pipeline processing between parallel algorithms.

What is future made for ?

It stores a task and the thread which is computing this taks. The result is what is calculated by the thread. It is very close to the usual concept of a future.

Why insisting on IOs, input iterators, streams, etc which are very difficult to parallelize ?

Programs are made of CPU-intensive parts and IO intensive parts. Unfortunately, parallelization libraries usually focus on CPU intensive parts - because the benefit is bigger and easier. Unfortunately, it often misses the point because performance bottelenecks are often, too often, in programs parts made of CPU and IO mixed together : Real-life programs love containers such as lists and sets more than arrays, because they are more flexible. This is why we aim at providing some sorts of parallelization even in the more extreme cases. For exemple :

Of course, the results are not always fantastic - far from that - but "real" programs are difficult to parallelize, and form the vast majority of existing (and too slow) programs. So this library tries to provide some sort of speed-up in all situations, to relieve the programmer of the burden to think about the behaviour of its program and of its performance too, at the same time. This is the same approach as MPI, which allows to add parallelism directives, to a program which is running and tested in sequential mode.

The documentation is incomplete and/or outdated.

Yes, sorry for that. There are plans to make a very complete documentation, but it takes time. A good source of documentation is using doxygen, because comments are intended to be picked up by doxygen.

The term 'pipeline' is misleading.

Yes. Our pipe_circular and pipe_archiver are close to tbb::concurrent_queue, and have nothing to do with tbb::pipeline. But the fact is that they try to mimic Unix's pipes: Data in and data out, following the insertion order. Therefore the name.

Any long-term objectives ?

Yes, many:

Table des matières