GPUWorker master/slave multi-gpu approach

Yeah, that race condition is tricky to see. What happens is the pointer d_array isn’t set until the cudaMalloc is actually executed. The value of d_array is copied temporarily into the bind object when the bind object is created so d_array won’t be set correctly in the memcpy unless there is a sync() after the cudaMalloc.

There are a number of other weird race conditions that can show up. Although I can’t think of a specific example now.

It’s a perfectly good question. I’ve been planning on updating the docs for GPUWorker for a while to answer exactly this question. I’ll try to take care of this sometime in the next week. For now, here is the short / not carefully written version:

To avoid these dangerous race conditions with callAsync(), call() should always be used unless the call needs to be asynchronous because you are going to queue up a call on another gpu. sync() should then be called on both GPUs immediately after. So, say you are going to copy memory and run kernels on 2 gpus. callAsync can be used like so:

gpu1.callAsync(bind(cudaMemcpy(d_array1, h_array1, n_bytes1, cudaMemcpyHostToDevice)));

gpu2.callAsync(bind(cudaMemcpy(d_array2, h_array2, n_bytes2, cudaMemcpyHostToDevice)));

gpu1.callAsync(bind(kernel(d_array1, params...));

gpu2.callAsync(bind(kernel(d_array2, params...));



Initialization code should always use call() to avoid race conditions, unless you really need parallel initialization and are very careful with sync().

Thank you for the answer. It look clearer to me now. Normally I will group my calls in a function so it will run on one GPU exactly as what happens with single GPU, and i will use the callAsyn() to execute the function with each individual GPU, then synchronize the operation with sync() functions.

For the example above it will look like

void mallocAndCopyAndRun(void** d_data, void* h_data, int size){
cudaMalloc(d_data, size);
cudaMemcpy(*d_data, h_data, size, cudaMemcpyHostroDevice);

then in the main program

gpu1.callAsync(bind(mallocAndCopyAndRun,(void**)&d_data1, h_data1, size1));
gpu2.callAsync(bind(mallocAndCopyAndRun,(void**)&d_data2, h_data2, size2));


Is that the right way and best way to do ?

Yep, that looks good.

It seems like the boost requirement comes into play only on the function bindings so you can accept arbitrary cuda* functions/parameters on the stack. I can’t think of any other simple C library out there that would support such a construct. Anyone know anything that could be used as a replacement?


The keywords to search for are “function delegate library”. Many of them out there claim to be faster than boost, but have varying degrees of limitations. One promising one is “Fast C++ Delegate: Boost.Function ‘drop-in’ replacement and multicast”, although after looking at their syntax examples It’s not obvious how it is a “drop-in” replacement.

I just managed to write my own simple bind function so that I dont need “boost”. Works on any C++ platform.

It works for “printf”… I tested upto 9 arguments.

I can give an outline of what I did (probably many C++ gurus already know of this)…

The idea is to write a template function (one each with same name ‘bind’ for different arguments) like

template <typename F , typename A1, typename A2, typename A3 > AbstractBaseClass *bind (F f, A1 a1, A2 a2, A3 a3)


 Â  Â  Â  .....


Inside this function, you need to create a new template class object. Since compiler does automatic type-inferencing, the “bind” function’s template variables would automatically have the correct types when the template is expanded. (so I dont need to call bind < int ()(int), int>(…) etc…

The new template class object would store the function pointer and arguments in the constructor. This template class would also overload the () operator (function call operator). This template class would be derived from an abstract base class which would just define a pure virtual function-call operator overloaded function.

The bind function would return the abstract base class (with only one pure virtual function-call overloading function) pointer to the caller. And, thats it… Your bind is ready…

Well, This was exciting for me because I have just stepped into C++ world… I thought it could be of some use to some1.

Good Luck

Best Regards,


MisterAnderson42, thank you for this great contribution. I’m in the process of wrapping my mind around GPUWorker… but do have a quick question:

I’ve been working with basic multiGPU code for a while. Unfortunately, I’ve discovered that spawning a CPU thread each time I want my GPUs to do work is very costly. What I therefore need to do is:

1. Spawn some CPU threads… one per GPU.
2. Each time the user triggers a particular event, the threads start their GPU kernels, and wait until such kernels finish.
3. Keep threads alive, until program exit.

Is this something that GPUWorker would handle?

Yep, that is exactly what GPUWorker is built to do. Except that it completely abstracts each “event” with a boost::bind function object so you can call any function in the worker thread that returns cudaError_t. See the in source documentation for examples.

I’ve been looking over to see how this works. I assume the contents within kernel_caller are conceptually equivalent to what is in the solverThread method within simpleMultiGPU SDK example?

Also, I inserted the source into the simpleMultiGPU project, but the VS 2005 linker outputs a series of LNK2019 errors like this:

1>GPUWorker.obj : error LNK2019: unresolved external symbol "public: void __cdecl boost::thread::join(void)" (?join@thread@boost@@QEAAXXZ) referenced in function "public: __cdecl GPUWorker::~GPUWorker(void)" (??1GPUWorker@@QEAA@XZ)



1>../../bin/win64/Debug/simpleMultiGPU.exe : fatal error LNK1120: 11 unresolved externals

Am I doing the linking against boost thread library wrong? The only linker option I’ve set is “C:\Program Files (x86)\boost\boost_1_35_0\lib” under Additional Library Directories.

kernel_caller is just a simple C function to call the kernel and return the resulting error code (if desired). I don’t know how this relates to the solver thread in the multi GPU sample code as I’ve never looked at that code. In any case, kernel_caller is not a thread, just a function that can be bound and passed to the worker thread queue to be called in that thread.

Enable the define DBOOST_LIB_DIAGNOSTIC and see if it is properly auto-linking to the correct boost library (libboost_threads in this case).

If you are building on win64, you will need to also define DBOOST_BIND_ENABLE_STDCALL .

MisterAnderson42: thanks for the clarification on kernel_caller.

However, I’m not completely following you regarding DBOOST_LIB_DIAGNOSTIC and DBOOST_BIND_ENABLE_STDCALL. I’ve never used Boost before, and don’t have a much experience debugging linker problems. I recently asked for help on Expert’s Exchange, but the solution hasn’t been nailed down yet.

I am indeed trying to build on Vista64, using Boost v1.35.0. If working through this problem seems too cumbersome, do you have a simple GPUWorker VS project that correctly builds on Vista64?

Then you definitely need to define DBOOST_BIND_ENABLE_STDCALL.

Sorry, I don’t have any simple VS project. The only app I have using GPUWorker is HOOMD, which it was written for.

This looks amazing. I’m going to try it this week with the NVIDIA PSC (4x Tesla C1060) and take some benchmarks of a massive matrix multiplication we’re doing (~65000 independent matrix muls, AxA^T, where A is 132x1)

I’m trying to play with this right now in my project. When I add the GPUworker to my MSVC project, I get an error:
function_base.hpp(625): error: must be included before typeid is used

This error shows up for a bunch of files in the boost library. I see that typeinfo is being included in function_base.hpp, so I don’t understand the error. Has anybody else seen this or have a suggestion?


Nevermind… I’m an idiot. I was including the GPUWorker header in another header that was included in a source file being compiled by nvcc (i.e. nvcc was trying to compile Boost code which is a known issue).

I am doing something similar to one of the earlier posters. I use a wrapper function to allocate my device memory.

cudaError_t AllocateAndInitializeFloatArray( float *&pTarget, const int &nSize )


	cudaMalloc( (void **)&pTarget, nSize*sizeof(float) );

	//Initialize the new array to zero.


I call it from a class that contains all of my device arrays (one instance of the class is allocated per GPU).

m_pGPUWorker->call( boost::bind(AllocateAndInitializeFloatArray, m_SomeArray, nSize) );

This works correctly without GPUWorker, but the array is always NULL when I use GPUWorker. If I use cudaMalloc with GPUWorker, that works correctly. Any ideas of what I’m doing wrong?


After a little more poking around, I found that my wrapper function is being called. cudaMalloc is called, and the target array appears to be allocated correctly. When the code returns to the class, the pointer is NULL.

What you are doing looks OK at first glance, but…

float *&pTarget

Is that a pointer to a reference or a reference to a pointer? I’m not an expert in such intricacies of C++. So I’m honestly not sure what exactly what cudaMalloc writes to when it modifies what is pointed to by the ** you pass it &pTarget. (the address of the pointer pointing to the reference or the address of the reference pointing to the pointer!?)

If you want my opinion, if you want to allocate memory like that, it is best just to pass the ** directly as an argument of AllocateAndInitializeFloatArray. I’m 100% positive that works.

It should be a reference to a pointer. I’ve double and triple checked to try to be absolutely certain. I’m going to try using a float **. That would be a perfectly good solution for me. =)

Switching to float ** definitely did the trick. Thanks!