CPU threads and CUDA

Dear All

I have implemented an image filtering function in CUDA on a GEForce 730M GPU, in a 2GHx i7 portable PC, using C++.

It works well.

However when I call using multiple CPU threads, the processed image is sometimes incorrect. I.e. the image is processed correctly but is from the input of a different thread.

However all is well when the GPU call is protected using a MUTEX. Hence I understand that the CPU <-> GPU interface must be via a single CPU thread.

Is that correct, or am I doing something incorrect?

Regards Martin

Bind the desired CUDA context to the new CPU thread first, likely with cuCtxSetCurrent().

More context management API:
http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html

I have to admit the question is not clear to me at all.

However the comment by SPWorley should not be necessary if the application is using the CUDA runtime API. The runtime API has mechanisms to ensure that a multithreaded application will always pick up the same CUDA context, per device.

It is not correct that the GPU <-> CPU interface must be via a single CPU thread.

The problem is not diagnosible from the information given.

In response:

You are correct that I am probably omitting an important detail:

  1. My GPU application uses ‘CudaSetDevice(0)’ to initialize the CPU-GPU interface which I understand calls cuCtxSetCurrent() as part of setup.
  2. HOWEVER the application uses persistent GPU memory which is established once at startup and used for all subsequent calls across multiple threads! My intention was to improve throughput by avoiding the over-head of getting and freeing up memory, and initial testing proved it was logically ok and improved performance!

BUT I presume that this practice with GPU memory is inappropriate, and is leading to data being intermittently exchanged between threads. I.e. use of a mutex at the CPU end prevents scrambling of the data.

I wouldn’t presume that.

You may have a race condition or some other issue. Rather than theorizing suspected systemic issues, I would focus on debugging your code.

Further to what txbob said, multiple concurrent host threads obviously have to use separate memory to store the image to process for each thread.

If you are using the same memory for all host threads, this is the reason why threads sometimes overwrite each other’s memory.

If you just create a memory pool at startup, from which you assign separate memory to each host thread, check your pooling logic.

Thanks for all comments. I attach the kernel invocation code below.

My original understanding (in error?) was that the GPU processes its work sequentially, and therefore multiple CPU threads calling a routine could co-exist using persistent GPU memory.

As per comments above, my current understanding is that to achieve maximum throughput I have the following options:

  1. Avoid the O/H of creating/freeing memory on the GPU and protect the persistent memory in GPU with a mutex on the CPU side
  2. Create/free transient memory on the GPU on every call

Or is there a flaw in the code?

CODE:
[i]
unsigned char *ModeFilter(unsigned char *h_img, unsigned char *h_dest, int width, int height, int radius, int bitshift)
{
int stride = 4;
int Block_height, Block_width;
int widthadj = int(width / stride) + 2;
int heightadj = int(height / stride) + 2;
static unsigned char *d_img, *d_dest;
Block_height = 16;
Block_width = 16;
const dim3 grid(iDivUp(widthadj, Block_width), iDivUp(heightadj, Block_height), 1);
const dim3 block(Block_width, Block_height, 1);

cudaSetDevice(0);

static boolean START = true;
if (START) {
START = false;
checkCudaErrors(cudaMalloc((void **)&d_img, (width * height * sizeof(unsigned char))));
checkCudaErrors(cudaMalloc((void **)&d_dest, (width * height * sizeof(unsigned char))));
}

checkCudaErrors(cudaDeviceSynchronize());

// Load data
checkCudaErrors(cudaMemcpy(d_img, h_img, sizeof(unsigned char) * width*height, cudaMemcpyHostToDevice));

ModeFilter_Kernel_Function << <grid, block >> > (d_img, d_dest, width, height, radius, bitshift, stride);

checkCudaErrors(cudaDeviceSynchronize());
if( h_dest != NULL)
checkCudaErrors(cudaMemcpy(h_dest, d_dest, sizeof(unsigned char) * width*height, cudaMemcpyDeviceToHost));

return( d_dest );
}
[/i]

The flaw is exactly what tera said: “If you are using the same memory for all host threads, this is the reason why threads sometimes overwrite each other’s memory.”

----> static unsigned char *d_img, *d_dest;

One could use the C++11 thread_local keyword here for the boolean START as well as the pointers d_img and d_dest, to require minimal changes.

Christian