CUDA thread in background?

I’m a Phd student in Computer Vision, and I’m in the process of the converting pure C++ image processing programs into C++/CUDA.

I’m facing extreme difficulty mainly in parallelising the programs. Perhaps my idea of the whole thing is a little off, but I assume that when random access to any location in an image is required within any CUDA block, then it is quicker to run it on a multicore CPU with a fast clock? I do notice when I do this though that although my (probably poorly written) GPU programs perform much slower than the CPU in these cases, using the CPU results in the one of the cores to max out and the other to go halfway, and the notebook fan goes full speed. I’m scared to run programs like this for extended periods in case I burn some circuitry inside.

I’m still struggling to understand how best to organise blocks and threads, but meanwhile, running deviceQuery tells me max block size is 512x512x64. Does that mean I can load an entire 512x512x3(channel) RGB image within a single block? I experimented with running on a single block and my program crashed, so clearly I did something terribly wrong.

Is there a way to execute a CUDA thread in the background while the main C++ program continues to do other things while waiting for CUDA results? If that’s possible then I can squeeze productivity even from one of my slow GPU programs, simply by getting the CPU to accomplish other tasks while the GPU’s at it.

Lastly, could anyone point me in the right direction to go about organising blocks and threads if I want to calculate for instance the euclidean distance between the local histogram in a 3x3 window centered on each pixel and the entire image histogram, for a 3 channel image?


Or better yet, how to compute a complete distance matrix for the entire image most efficiently.

There is a limit of 512 total threads per block on current hardware, with a maximum dimension size of 512 for x and y and 16 for z. So you can have 512x1x1, or 1x512x1, or 32x1x16, or 1x32x16, etc. 512 threads per block total.

That is the standard model. CUDA kernel launches are non-blocking for the host. Efficient code can and should overlap device computation with host computation.

I receive a speedup of 2,5 even for randmom accesses with CUDA. It’s not that much, but I compare it to a OpenMP implementation on a Core 7i processor against a Quadro FX 4800.

Thanks. So it is possible to process an entire small image within a single block, but is doing this advisable? My only motivation for thinking along those lines is faster random pixel read/write.

I’m very very confused on this one. Doesn’t the host hang anyway until the device function call returns something? For instance if on the host function I use:

prepare <<< n_blocks, block_size >>> (d_eB, d_eG, d_eR);

, then you are saying the host function proceeds past ‘prepare’ immediately after calling it? If GPU and CPU control flow proceed simultaneously, at what point does the CPU know to retreive the GPU results?

Then my code must have some serious flaws. Would you be able to give me a quick idea about how best to organise threads and blocks for fast image processing? Are you using texture fetches or shared memory (or global for that matter) ?

That is how it works. The standard copy operation (cudaMemcpy in the runtime API) is blocking. So if you host code reaches a copy while the GPU is still running a kernel, then the host will spinlock until the kernel execution finishes and then perform the copy. The runtime API also includes an explicit host-GPU synchronization call (cudaThreadSynchronize) which you can call if required. The runtime API streams functionality takes the model even further, allowing multiple command streams to the same GPU, and includes asynchronous copying and individual thread stream synchronization. This is discussed in the programming guide, and there is an SDK example code (simpleStreams IIRC) which demonstrates how streams and copy/execution overlap can work using the asynchronous versions of the standard API functions like memcpy and copyToSymbol.

Wow, thanks, that certainly clarifies it. So I understand that the way to go is calling cudaMemcpy at the last possible moment before the CUDA data is actually used by the host. One of my mistakes was to call cudaMemcpy immediately after launching the kernel. No wonder the performance was so lousy.

I assume this also means that a while(1){} in the kernel would lock up GPU resources, rendering all other computation effort futile if the kernel is large enough, until a system restart.

The current hardware can only run one kernel at a time, and it can only overlap execution and copying, so the first part of your hypothesis is correct by default. But the second isn’t. The executing kernel only has the life of the GPU context to which it is attached. If you application closes the context (or exits which has the same effect), the kernel should be terminated. There is also an operating system watchdog timer which prevents any kernel from monopolizing a GPU with an attached display for more than a few seconds. Any kernel which doesn’t yield in that time gets terminated by the driver (this doesn’t apply to dedicated compute cards). So a kernel with an infinite loop can be stopped from user space without a system restart, and there are certain protections in the driver to guarantee that it should not (at least in theory) be possible for the OS to loose control of the GPU.

Nice, thanks