CUDA thread in background?

genisys · February 19, 2010, 4:49am

I’m a Phd student in Computer Vision, and I’m in the process of the converting pure C++ image processing programs into C++/CUDA.

I’m facing extreme difficulty mainly in parallelising the programs. Perhaps my idea of the whole thing is a little off, but I assume that when random access to any location in an image is required within any CUDA block, then it is quicker to run it on a multicore CPU with a fast clock? I do notice when I do this though that although my (probably poorly written) GPU programs perform much slower than the CPU in these cases, using the CPU results in the one of the cores to max out and the other to go halfway, and the notebook fan goes full speed. I’m scared to run programs like this for extended periods in case I burn some circuitry inside.

I’m still struggling to understand how best to organise blocks and threads, but meanwhile, running deviceQuery tells me max block size is 512x512x64. Does that mean I can load an entire 512x512x3(channel) RGB image within a single block? I experimented with running on a single block and my program crashed, so clearly I did something terribly wrong.

Is there a way to execute a CUDA thread in the background while the main C++ program continues to do other things while waiting for CUDA results? If that’s possible then I can squeeze productivity even from one of my slow GPU programs, simply by getting the CPU to accomplish other tasks while the GPU’s at it.

Lastly, could anyone point me in the right direction to go about organising blocks and threads if I want to calculate for instance the euclidean distance between the local histogram in a 3x3 window centered on each pixel and the entire image histogram, for a 3 channel image?

Best,
G.

genisys · February 19, 2010, 6:01am

Or better yet, how to compute a complete distance matrix for the entire image most efficiently.

avidday · February 19, 2010, 7:47am

There is a limit of 512 total threads per block on current hardware, with a maximum dimension size of 512 for x and y and 16 for z. So you can have 512x1x1, or 1x512x1, or 32x1x16, or 1x32x16, etc. 512 threads per block total.

That is the standard model. CUDA kernel launches are non-blocking for the host. Efficient code can and should overlap device computation with host computation.

CapJo · February 19, 2010, 10:47am

I receive a speedup of 2,5 even for randmom accesses with CUDA. It’s not that much, but I compare it to a OpenMP implementation on a Core 7i processor against a Quadro FX 4800.

genisys · February 19, 2010, 2:40pm

Thanks. So it is possible to process an entire small image within a single block, but is doing this advisable? My only motivation for thinking along those lines is faster random pixel read/write.

I’m very very confused on this one. Doesn’t the host hang anyway until the device function call returns something? For instance if on the host function I use:

prepare <<< n_blocks, block_size >>> (d_eB, d_eG, d_eR);

, then you are saying the host function proceeds past ‘prepare’ immediately after calling it? If GPU and CPU control flow proceed simultaneously, at what point does the CPU know to retreive the GPU results?

genisys · February 19, 2010, 2:44pm

Then my code must have some serious flaws. Would you be able to give me a quick idea about how best to organise threads and blocks for fast image processing? Are you using texture fetches or shared memory (or global for that matter) ?

avidday · February 19, 2010, 2:49pm

For instance if on the host function I use:
prepare <<< n_blocks, block_size >>> (d_eB, d_eG, d_eR);
, then you are saying the host function proceeds past ‘prepare’ immediately after calling it? If GPU and CPU control flow proceed simultaneously, at what point does the CPU know to retreive the GPU results?

That is how it works. The standard copy operation (cudaMemcpy in the runtime API) is blocking. So if you host code reaches a copy while the GPU is still running a kernel, then the host will spinlock until the kernel execution finishes and then perform the copy. The runtime API also includes an explicit host-GPU synchronization call (cudaThreadSynchronize) which you can call if required. The runtime API streams functionality takes the model even further, allowing multiple command streams to the same GPU, and includes asynchronous copying and individual thread stream synchronization. This is discussed in the programming guide, and there is an SDK example code (simpleStreams IIRC) which demonstrates how streams and copy/execution overlap can work using the asynchronous versions of the standard API functions like memcpy and copyToSymbol.

genisys · February 19, 2010, 2:53pm

That is how it works. The standard copy operation (cudaMemcpy in the runtime API) is blocking. So if you host code reaches a copy while the GPU is still running a kernel, then the host will spinlock until the kernel execution finishes and then perform the copy. The runtime API also includes an explicit host-GPU synchronization call (cudaThreadSynchronize) which you can call if required. The runtime API streams functionality takes the model even further, allowing multiple command streams to the same GPU, and includes asynchronous copying and individual thread stream synchronization. This is discussed in the programming guide, and there is an SDK example code (simpleStreams IIRC) which demonstrates how streams and copy/execution overlap can work using the asynchronous versions of the standard API function.

Wow, thanks, that certainly clarifies it. So I understand that the way to go is calling cudaMemcpy at the last possible moment before the CUDA data is actually used by the host. One of my mistakes was to call cudaMemcpy immediately after launching the kernel. No wonder the performance was so lousy.

genisys · February 19, 2010, 3:04pm

I assume this also means that a while(1){} in the kernel would lock up GPU resources, rendering all other computation effort futile if the kernel is large enough, until a system restart.

avidday · February 19, 2010, 4:09pm

The current hardware can only run one kernel at a time, and it can only overlap execution and copying, so the first part of your hypothesis is correct by default. But the second isn’t. The executing kernel only has the life of the GPU context to which it is attached. If you application closes the context (or exits which has the same effect), the kernel should be terminated. There is also an operating system watchdog timer which prevents any kernel from monopolizing a GPU with an attached display for more than a few seconds. Any kernel which doesn’t yield in that time gets terminated by the driver (this doesn’t apply to dedicated compute cards). So a kernel with an infinite loop can be stopped from user space without a system restart, and there are certain protections in the driver to guarantee that it should not (at least in theory) be possible for the OS to loose control of the GPU.

genisys · February 19, 2010, 4:31pm

The current hardware can only run one kernel at a time, and it can only overlap execution and copying, so the first part of your hypothesis is correct by default. But the second isn’t. The executing kernel only has the life of the GPU context to which it is attached. If you application closes the context (or exits which has the same effect), the kernel should be terminated. There is also an operating system watchdog timer which prevents any kernel from monopolizing a GPU with an attached display for more than a few seconds. Any kernel which doesn’t yield in that time gets terminated by the driver (this doesn’t apply to dedicated compute cards). So a kernel with an infinite loop can be stopped from user space without a system restart, and there are certain protections in the driver to guarantee that it should not (at least in theory) be possible for the OS to loose control of the GPU.

Nice, thanks

Topic		Replies	Views
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20156	May 4, 2007
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134586	May 26, 2010
Odd performance problem/question CUDA Programming and Performance	3	837	June 3, 2009
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	888	July 27, 2017
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4202	May 13, 2010
Multiple kernels in flight? CUDA Programming and Performance	19	26848	August 28, 2007
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9601	January 1, 2009
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4327	March 7, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23205	October 7, 2010
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6098	November 3, 2008

CUDA thread in background?

Related topics