GPUWorker master/slave multi-gpu approach

I’m (finally) gearing up to write some muli-gpu code. As you might know, the CUDA model is one thread = one GPU context, which is suited to a peer model of threading. That is fine in many cases, but my application structure is very modular and pretty much incompatible with a peer threading model. Sure, I could change things, but doing so will introduce a lot of complications I would prefer to avoid. Thus I needed some way to use a master/slave thread approach, where a worker thread holds a CUDA context and the master thread can send messages to many slave threads. GPUWorker was born. Since this may be useful to someone else (and the code is open source), I thought I’d share it with everyone.


  • A single master thread can call CUDA runtime and kernel functions on multiple GPUs

  • ANY CUDA runtime function (actually, any function returning cudaError_t) can be called in the worker thread easily with a simple syntax

  • No performance difference from straight CUDA calls (in realistic situations, see performance tests below)

  • Works in windows and linux


  • A slight extra latency is added to synchronous calls (due to OS thread scheduling)


GPUWorker gpu0(0);

GPUWorker gpu1(1);

// allocate data

int *d_data0;, (void**)((void*)&d_data0), sizeof(int)*N));

int *d_data1;, (void**)((void*)&d_data1), sizeof(int)*N));


// call kernel

gpu0.callAsync(bind(kernel_caller, d_data0, N));

gpu1.callAsync(bind(kernel_caller, d_data1, N));

Get the code…orker.h?rev=994…

Using the code is easy: just compile into your project. Note that you probably want to remove the #ifdef USE_CUDA macro guard, this is used in HOOMD for CPU only builds. You also need to have boost ( installed and link against the boost thread library.

The code is part of HOOMD which is released under an open source license: see the file for the details. The code also contains extensive documentation in doxygen style code comments.

Performance tests

All the mutex locks, context switches, etc… do add up to a small bit of extra overhead for each call. This is most apparent when making synchronous calls. The simplest test I can think of to measure the overhead is to repeatedly copy 4 bytes from the device to the host. Here are the results (tested in 64-bit linux on a single GPU of the Tesla D870):

GPUWorker latency test

Time per call 34.431 us

Standard latency test

Time per call 24.381 us

As you can see, the increased latency is significant. GPUWorker is not for you if your application depends on the best possible latency is such operations.

However, in more realistic situations (at least for my application) making thousands of ~10ms kernel calls in a row poses no performance penalty. Again, this test is on a single GPU of the Tesla D870

Standard realistic test

Time per step 11082.2 us

GPUWorker realistic test

Time per step 11080.6 us

In multiple runs, the delta on the time measurements is +/- 5us so the difference is in the noise.

Running the same realistic test on two peer type worker threads without GPUWorker gives the following timings: (This test uses both GPUs in the D870)

Peer-based mgpu test (GPU 0)

Peer-based mgpu test (GPU 1)

Time per step (GPU 0) 11081.8 us

Time per step (GPU 1) 11079.6 us

And running the realistic test on both GPUs using GPUWorker gives the following result:

Master/slave-based mgpu test

Time per step 11083 us

The conclusion is simple: In realistic situations with many contiguous asynchronous calls, there is no apparent performance penalty. If you want to see the full code of the benchmarks, look here:…

Sir, your kindness and ingenuity have put goosebumps on my wretched body. Thank you! This is very valuable, and is probably so for a great deal of people out there.

I would like to express the same feelings :)

Very nice indeed. One of my projects might benefit from this enormously. Hmmm, thinking a bit more about this, now I want to stuff as many CUDA cards into my machine as possible, so this is going to be an expensive library for my boss…

This functionality looks pretty useful. I wonder if it would be worthwhile to integrate this into the CuPP project. Given that it and HOOMD both use BSD-style licenses, it’s probably not an issue of logistics.

Thanks MisterAnderson for this excellent tool! I wrote a GRAPE6-like library, and wanted to extend it for multiGPU support. And standard way, as presented in SDK, is too awkward to implement; probably, I have similar problem as in HOOMD.

GPUWorker solved my problem! I just lose 5 GFLOP/s per GPU, and instead of 250, I am getting 245GFLOP/s, but it means from two I’ll get nearly 490!
=== UPDATE ===
I do not loose 5GFLOP/s per GPU. The C++ part of code was compiles with -O0 flags and compared against -O3 compiled code. After comparing apples with apples, i.e. -O3 with -O3, there is nearly no loss of performance!
=== UPDATE ===

Great job, very appreciated! Are there any papers where it is implemented, so that I could cite it in my paper which will be soon published?


Cool, I’m glad to hear it’s working out for you. I’ve got the single-gpu HOOMD switched over to GPUWorker as a first step and as you noticed in your code, there are no performance penalties when compiling with optimizations enabled.

Here is the reference for the HOOMD paper.
Journal of Computational Physics 227 (2008) 5342-5359

The GPUWorker class is really useful, but I have compatibility issues with Boost libraries.
Here on my system I can handle them just fine, but the final program will be run on a remote Tesla: my system is 32 bit, the remote one 64 bit, hence I cannot just move around the executables, I have to recompile the whole software.

Do you have a version w/o Boost (ie, using plain pthread) ?
I’ve looked at the code; it’s not long, I could convert it manually. The main problem would be to implement a replacement to Boost::bind.
May you help me, please?

Does your distribution have a 32-bit compatibility library for boost? I think that most redhat type distributions do, although I could be wrong.

For HOOMD, I just statically link the boost libraries for the distribution executable.

Before writing this, I did find a few alternatives to boost::bind that were almost as general. IIRC, the best phrase to search for was “function delegate”. Sorry, I don’t recall any specifics about which libraries I thought promising.

This is quite strange. I have no problem compiling my code, which uses GPUWorker, on both 32bit and 64bit systems. What kind of problems you’re running into while compiling your code on 64bit system?

Sorry, the problem is not on compiling.
I meant that I can compile it under my 32 bit system, but then I cannot run the executable on the 64bit system because there is no 32 bit CUDA installed (hence execution fails by lack of dynamic libraries needed).

Sorry, but I am not fully following you. I have both 32bit and 64bit systems available. On 32bit I installed both 32bit CUDA and libboost_trhead, and on 64bit I installed both 64bit CUDA, and libboost_thread. In this case, my code compiles & runs on both 64bit and 32bit.

What is your setup?

Yes, but I don’t have root privileges on the remote machine, and it only have 64 bit CUDA.
However I’ve just noticed it has boost installed, so I would be able to compile on that, too.

In my opinion, thought, it would be more useful to have a plain implementation of GPUWorker (I don’t usually code with Boost, and it take much time to me to download/compile/install it).
Actually a plain class, with just pthread and a plain implementation of Boost::bind would be more lightweight and easier to port to other systems.

My setups include 64-bit linux, 32-bit linux, windows XP 32, Mac OS X and Vista 64-bit. I statically link boost so that end users who download my executable don’t have to go through all the headaches of installing boost.

That is probably the best solution.

I agree, the boost requirement is the one drawback to GPUWorker. Installing boost can be difficult, even on systems that provide a package in the repository (i.e. ubuntu/debian require installing about a dozen different boost packages to get everything working, although gentoo just needs “emerge boost” :) ).

But I don’t agree that using pthreads will make it more portable. I need it to run on Windows too! So any solution that is to be as portable as the original must use both a cross-platform threading and a function delegate library. Boost has both (and I was already using boost) so I went with that. And boost is very portable across many platforms.

Feel free to re-implement the code with whatever libraries you prefer to use. I don’t have the time to do so myself.

Yes, statically linking is a great solution, I didn’t think that.

Actually, I didn’t mean pthreads are more portable (it is different to port pthread programs even on different UNIXes); I meant that the code would be easily modifiable to use another threading library.

However, thank you for your answer.

Excellent work!
Thank you very much for your contribution!

Thank you very much. It is very kind of you to provide this framework, your implementation is neat and robust, it will greatly reduce the trouble and debuging time for a thread novice like me.

BTW , i understand that GPU worker help us perform same function in parallel on different GPU, i wonder how to synchronize , or set the barrier in this framework, for example each GPU process the same image, at some point, we combine the result then send it back to GPU to continue the process. How can we do something like that. I don’t want to use thread join since it will destroy the GPU context

Just call sync() on each GPUWorker thread. That will ensure that all queued calls up to that point have been submitted to the GPU.

Submitted maybe, but have these calls returned yet?

Would I have to send a ThreadSynchronize() call to each of the workers to make sure they are actually done?


Of course. You can make things simpler then by using calling cudaThreadSynchronize with call() which has a built in call to sync().

I don’t understand why callAsyn() cause the race condition. For example

gpu.callAsync(bind(cudaMalloc(&d_array, n_bytes)));
gpu.callAsync(bind(cudaMemcpy(d_array, h_array, n_bytes, cudaMemcpyHostToDevice)));

Because the calls will be inserted to a queue, it preserve the order, and so that it will be executed on the same device with the same order as

cudaMalloc(&d_array, n_bytes);
cudaMemcpy(d_array, h_array, n_bytes, cudaMemcpyHostToDevice);

so how the order can be changed.

Thank you,

It maybe a stupid question but i’m new with thread programing, i want to know when i can use callAsync, when i can not, the call function serialize the operation seems not my choice in most cases