CPU threads and CUDA

MartinRo · January 31, 2017, 8:51pm

Dear All

I have implemented an image filtering function in CUDA on a GEForce 730M GPU, in a 2GHx i7 portable PC, using C++.

It works well.

However when I call using multiple CPU threads, the processed image is sometimes incorrect. I.e. the image is processed correctly but is from the input of a different thread.

However all is well when the GPU call is protected using a MUTEX. Hence I understand that the CPU <-> GPU interface must be via a single CPU thread.

Is that correct, or am I doing something incorrect?

Regards Martin

SPWorley · January 31, 2017, 9:09pm

Bind the desired CUDA context to the new CPU thread first, likely with cuCtxSetCurrent().

More context management API:
http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html

Robert_Crovella · February 2, 2017, 4:30am

I have to admit the question is not clear to me at all.

However the comment by SPWorley should not be necessary if the application is using the CUDA runtime API. The runtime API has mechanisms to ensure that a multithreaded application will always pick up the same CUDA context, per device.

It is not correct that the GPU <-> CPU interface must be via a single CPU thread.

The problem is not diagnosible from the information given.

MartinRo · February 6, 2017, 11:30pm

In response:

You are correct that I am probably omitting an important detail:

My GPU application uses ‘CudaSetDevice(0)’ to initialize the CPU-GPU interface which I understand calls cuCtxSetCurrent() as part of setup.
HOWEVER the application uses persistent GPU memory which is established once at startup and used for all subsequent calls across multiple threads! My intention was to improve throughput by avoiding the over-head of getting and freeing up memory, and initial testing proved it was logically ok and improved performance!

BUT I presume that this practice with GPU memory is inappropriate, and is leading to data being intermittently exchanged between threads. I.e. use of a mutex at the CPU end prevents scrambling of the data.

Robert_Crovella · February 7, 2017, 5:06am

I wouldn’t presume that.

You may have a race condition or some other issue. Rather than theorizing suspected systemic issues, I would focus on debugging your code.

tera · February 7, 2017, 3:32pm

Further to what txbob said, multiple concurrent host threads obviously have to use separate memory to store the image to process for each thread.

If you are using the same memory for all host threads, this is the reason why threads sometimes overwrite each other’s memory.

If you just create a memory pool at startup, from which you assign separate memory to each host thread, check your pooling logic.

MartinRo · February 9, 2017, 1:06am

Thanks for all comments. I attach the kernel invocation code below.

My original understanding (in error?) was that the GPU processes its work sequentially, and therefore multiple CPU threads calling a routine could co-exist using persistent GPU memory.

As per comments above, my current understanding is that to achieve maximum throughput I have the following options:

Avoid the O/H of creating/freeing memory on the GPU and protect the persistent memory in GPU with a mutex on the CPU side
Create/free transient memory on the GPU on every call

Or is there a flaw in the code?

CODE:
[i]
unsigned char *ModeFilter(unsigned char *h_img, unsigned char *h_dest, int width, int height, int radius, int bitshift)
{
int stride = 4;
int Block_height, Block_width;
int widthadj = int(width / stride) + 2;
int heightadj = int(height / stride) + 2;
static unsigned char *d_img, *d_dest;
Block_height = 16;
Block_width = 16;
const dim3 grid(iDivUp(widthadj, Block_width), iDivUp(heightadj, Block_height), 1);
const dim3 block(Block_width, Block_height, 1);

cudaSetDevice(0);

static boolean START = true;
if (START) {
START = false;
checkCudaErrors(cudaMalloc((void **)&d_img, (width * height * sizeof(unsigned char))));
checkCudaErrors(cudaMalloc((void **)&d_dest, (width * height * sizeof(unsigned char))));
}

checkCudaErrors(cudaDeviceSynchronize());

// Load data
checkCudaErrors(cudaMemcpy(d_img, h_img, sizeof(unsigned char) * width*height, cudaMemcpyHostToDevice));

ModeFilter_Kernel_Function << <grid, block >> > (d_img, d_dest, width, height, radius, bitshift, stride);

checkCudaErrors(cudaDeviceSynchronize());
if( h_dest != NULL)
checkCudaErrors(cudaMemcpy(h_dest, d_dest, sizeof(unsigned char) * width*height, cudaMemcpyDeviceToHost));

return( d_dest );
}
[/i]

max.wittal · January 13, 2018, 5:36am

The flaw is exactly what tera said: “If you are using the same memory for all host threads, this is the reason why threads sometimes overwrite each other’s memory.”

----> static unsigned char *d_img, *d_dest;

cbuchner1 · January 15, 2018, 2:31pm

One could use the C++11 thread_local keyword here for the boolean START as well as the pointers d_img and d_dest, to require minimal changes.

Christian

Topic		Replies	Views
CUDA thread in background? CUDA Programming and Performance	10	15985	February 19, 2010
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4193	May 13, 2010
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3021	November 13, 2007
Multiple CPU threads run with CUDA Graphs causes continuous increase in memory usage CUDA Programming and Performance	3	722	October 16, 2020
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5441	February 3, 2010
video cards in parallel ? how the use of various video cards in parallel? CUDA Programming and Performance	7	755	July 15, 2011
Odd performance problem/question CUDA Programming and Performance	3	831	June 3, 2009
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4855	April 8, 2009
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5544	April 28, 2012
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011

CPU threads and CUDA

Related topics