Kernel Runtime

I have question which relates to the runtime progress on the CPU,
when i call a kernel and i am waiting for the results for the GPU, can
you tell me what the CPU is doing while the GPU is processing data ?

for example :
… // CODE CONFIGRATION TO CALL CUDA

// calling the kernel , what the CPU should do here ?
kernel<<< grid, block>>>(…);

//transfering the data after we have finished proccessing
cudaMemcpy

This call returns immediately on the CPU, allowing the GPU to work in the background. You can do any CPU side calculations you want after the kernel call.

However, as soon as you call cudaMemcpy(), the CPU will wait for the GPU to finish before copying memory. (That way you don’t get incomplete results.) It is important to note that whenever a CUDA call is waiting for the GPU to finish, the CPU sits in a “hot spin” continuously checking whether the GPU is finished. You will see your CUDA program using 100% CPU in this case. The hot spin, while wasting CPU, minimizes latency between GPU functions, which improves the performance when you are making lots of short calls.

So, how can i get precise measurement for the proccessing time on the GPU , (without memory transfer) …

use cudaThreadSynchronize() to make the CPU WAIT for GPU to finish and then do the cudaMemcpy

Have you guys tested the idea that a kernel call returns before the kernel is finished? Perhaps it’s some oddity with my setup, but I’ve tried everything I can think of to get it to return control to the CPU before the kernel finishes, with no success so far.

I’ve tested it, and it works for up to 16 kernel calls on compute 1.0 devices and up to 24 kernel calls on compute 1.1 devices.

Are you also using the cuda profiler at the same time you are testing? Enabling the profiler puts an implicit cudaThreadSynchronize() after every kernel call.

No, just running a standard release build without the profiler.

So how are you verifying that it does this? I’ve tried doing something basic like:

__global__ void loop()

{

    while(1);

}

...inside a function...

loop<<<BLOCK_COUNT, THREAD_COUNT>>>();

printf("Past\n");

fflush(stdout);

The screen only prints “Past” after the watchdog timer kills the kernel.

Are you running CUDA on your display card? If so, it’s probably just that the display isn’t updating while the kernel is running (but it is actually “printed” immediately after the kernel is launched).

I test it with the attached code on linux. You’ll need to replace the gettimeofday call with a different timer on windows. It calls the kernel 32 times and records the time after each kernel call.

Here are the results. There is a small launch overhead for each kernel call, but notice the large jump in time at the 16th call, that is the time it took the first kernel call to execute since the queue is only 16 deep.

i = 1 / Time: 15.000000 us

i = 2 / Time: 29.000000 us

i = 3 / Time: 40.000000 us

i = 4 / Time: 52.000000 us

i = 5 / Time: 63.000000 us

i = 6 / Time: 75.000000 us

i = 7 / Time: 86.000000 us

i = 8 / Time: 97.000000 us

i = 9 / Time: 108.000000 us

i = 10 / Time: 120.000000 us

i = 11 / Time: 131.000000 us

i = 12 / Time: 142.000000 us

i = 13 / Time: 153.000000 us

i = 14 / Time: 165.000000 us

i = 15 / Time: 176.000000 us

i = 16 / Time: 187.000000 us

i = 17 / Time: 10399.000000 us

i = 18 / Time: 20736.000000 us

i = 19 / Time: 31082.000000 us

i = 20 / Time: 41454.000000 us

i = 21 / Time: 51819.000000 us

i = 22 / Time: 62181.000000 us

i = 23 / Time: 72520.000000 us

i = 24 / Time: 82859.000000 us

i = 25 / Time: 93210.000000 us

i = 26 / Time: 103538.000000 us

i = 27 / Time: 113922.000000 us

i = 28 / Time: 124269.000000 us

i = 29 / Time: 134645.000000 us

i = 30 / Time: 144996.000000 us

i = 31 / Time: 155375.000000 us

i = 32 / Time: 165742.000000 us

async_test.cu.txt (876 Bytes)

You’re exactly right, that’s what it was. I used the time function instead to measure the difference and I saw it was in-fact asynchronous.