pthread and CUDA a problem of using pthread with CUDA


I came across a problem on using pthread with CUDA 1.0. Below is the
outline of a toy program I used. It does C=A+B, where A,B,C are three
matrixes of the same size.

void Work(){
// Load A and B to device

// Create a pthread to do the addition
pthread_t threads;
pthread_create(&threads, NULL, Add_, NULL);



// The addition function on host
void * Add_ (void * f){
// compute block and grid dimension

 // Launch the device computation; Ad, Bd correspond to                                                                                                                                                        
 A and B on the device, and results stored to Cd                                                                                                                                                               
 Addd<<<dimGrid, dimBlock>>>(Ad, Bd, Cd, matrixWidth);

 // Copy Cd to host
 cudaMemcpy(C, Cd, matrixSize, cudaMemcpyDeviceToHost);

 // !!!! When USE_THREAD is defined, C always has zeros !!!!
 //      Otherwise, C contains correct results.


In the above code, if “USE_THREAD” is defined, C contains just zeros
(i.e., its initial contents); otherwise, C has the correct addition

Using emulator, I saw that Cd contains correct addition results no
matter “USE_THREAD” is defined or not. So, why cudaMemcpy() failed to
copy Cd to C when “USE_THREAD” is defined?

(The complete code I used is attached.)

  • I’m using CUDA 1.0 on a GeForce 8800GTX (SUSE Linux x86_64). (2.48 KB)

You should do all the CUDA stuff from one thread, so indeed, this doesn’t work. Different threads get different CUDA contexts which cannot share data at the moment.

I’m not quite sure about “CUDA contexts”. First, I would think that when you said “Different threads”, you are referring to CPU threads (instead of GPU threads). In my code, the array “Ad” is allocated on the device by thread I (parent thread), and thread II (child thread) uses the array for addition. Since “Cd” contains the correct addition results after thread II finishes its computation on GPU, it seems the two threads can reference the same array “Ad” (and “Bd”) on GPU successfully.

(I saw that “Cd” contains the correct results through emulator, which I hope is consistent with the real results.)

My problem is that, with pthreads, “cudaMemcpy” by thread II cannot copy “Cd” (on GPU) to “C” (on host). Can you elaborate on why “CUDA contexts” would cause this issue?

Wumpus was indeed speaking about CPU threads.
Cd does not contain the correct results when running on the GPU since there will be different cuda contexts for the two threads. Emulation is just that: emulation (on the cpu). Some things will work in emulation, but not on the device.

You cannot use Ad, Bd and Cd in a thread that did not CudaMalloc them, each thread has its own context and cannot exchange device-pointers with other threads.

The different contexts seem to be the reason. Emulation is deceiving in this case.

Thanks for the replies!

Emulation can be deceiving in more ways, be warned. You can also use host-pointers in kernels in emulation mode, but it will give you an unspecified launch failure when running on the device.

One of the many ways to spend a nice couple of hours/days ;)

Anyone care to give a vague timeframe on when we might get the ability to transfer contexts between threads? This was a real show-stopper for us. Not being able to transfer contexts makes it very difficult to build a CUDA-based library that is agnostic with respect to the client’s threading model.

That is, if we could transfer contexts, we certainly can be sure to acquire and release locks appropriately to make sure that only one thread is using the context at a time. But if we can’t transfer contexts, then we need to either (a) tell the client to only make API calls to our library with one thread or (b) run a daemon thread ourselves and communicate with that. Both create substantial problems, especially on a multiprocessor system. I’d much rather move a small CUDA context between a couple processors than a couple megs of client data.

Maybe this would make a nice entry into the wishlist post?

Already done, Denis (I believe from the pre-1.1 release bit of the wishlist). I’m afraid I’m a bit of a broken record on that issue.

I did this too, and it turned out to work very badly. This thread will spend most of its time busy-waiting (spinlooping) for the GPU, wasting a lot of CPU cycles. I solved this by making the GPU thread one of the CPU threads as well, thereby overlapping CPU and GPU processing. But it’s hardly a general solution.

NVidia should really put in some kind of interrupt mechanism, so that waiting for the GPU is blocking (using select() or poll() or so). Then you could have a almost fully idle GPU thread, that only sends commands to the GPU. At the moment, it’s a very bad idea.

CUDA contexts will be able to be ported between CPU threads in the next version (driver API clients only). Libraries will be able to create CUDA contexts and attach/detach them to the CPU thread that calls into them.

The main difference between this approach and the current workaround (delegating to a worker thread that owns the CUDA context) is that no CPU thread context switch will be performed.