Kernel Runtime

Admirer4 · July 6, 2008, 8:47am

I have question which relates to the runtime progress on the CPU,
when i call a kernel and i am waiting for the results for the GPU, can
you tell me what the CPU is doing while the GPU is processing data ?

for example :
… // CODE CONFIGRATION TO CALL CUDA

// calling the kernel , what the CPU should do here ?
kernel<<< grid, block>>>(…);

//transfering the data after we have finished proccessing
cudaMemcpy

seibert · July 6, 2008, 1:30pm

This call returns immediately on the CPU, allowing the GPU to work in the background. You can do any CPU side calculations you want after the kernel call.

However, as soon as you call cudaMemcpy(), the CPU will wait for the GPU to finish before copying memory. (That way you don’t get incomplete results.) It is important to note that whenever a CUDA call is waiting for the GPU to finish, the CPU sits in a “hot spin” continuously checking whether the GPU is finished. You will see your CUDA program using 100% CPU in this case. The hot spin, while wasting CPU, minimizes latency between GPU functions, which improves the performance when you are making lots of short calls.

Admirer4 · July 9, 2008, 5:07am

So, how can i get precise measurement for the proccessing time on the GPU , (without memory transfer) …

Sarnath · July 9, 2008, 5:44am

use cudaThreadSynchronize() to make the CPU WAIT for GPU to finish and then do the cudaMemcpy

Jason102 · July 9, 2008, 3:23pm

Have you guys tested the idea that a kernel call returns before the kernel is finished? Perhaps it’s some oddity with my setup, but I’ve tried everything I can think of to get it to return control to the CPU before the kernel finishes, with no success so far.

MisterAnderson42 · July 9, 2008, 4:15pm

I’ve tested it, and it works for up to 16 kernel calls on compute 1.0 devices and up to 24 kernel calls on compute 1.1 devices.

Are you also using the cuda profiler at the same time you are testing? Enabling the profiler puts an implicit cudaThreadSynchronize() after every kernel call.

Jason102 · July 9, 2008, 5:28pm

No, just running a standard release build without the profiler.

So how are you verifying that it does this? I’ve tried doing something basic like:

__global__ void loop()

{

    while(1);

}

...inside a function...

loop<<<BLOCK_COUNT, THREAD_COUNT>>>();

printf("Past\n");

fflush(stdout);

The screen only prints “Past” after the watchdog timer kills the kernel.

tmurray · July 9, 2008, 5:46pm

Are you running CUDA on your display card? If so, it’s probably just that the display isn’t updating while the kernel is running (but it is actually “printed” immediately after the kernel is launched).

MisterAnderson42 · July 9, 2008, 6:03pm

I test it with the attached code on linux. You’ll need to replace the gettimeofday call with a different timer on windows. It calls the kernel 32 times and records the time after each kernel call.

Here are the results. There is a small launch overhead for each kernel call, but notice the large jump in time at the 16th call, that is the time it took the first kernel call to execute since the queue is only 16 deep.

i = 1 / Time: 15.000000 us

i = 2 / Time: 29.000000 us

i = 3 / Time: 40.000000 us

i = 4 / Time: 52.000000 us

i = 5 / Time: 63.000000 us

i = 6 / Time: 75.000000 us

i = 7 / Time: 86.000000 us

i = 8 / Time: 97.000000 us

i = 9 / Time: 108.000000 us

i = 10 / Time: 120.000000 us

i = 11 / Time: 131.000000 us

i = 12 / Time: 142.000000 us

i = 13 / Time: 153.000000 us

i = 14 / Time: 165.000000 us

i = 15 / Time: 176.000000 us

i = 16 / Time: 187.000000 us

i = 17 / Time: 10399.000000 us

i = 18 / Time: 20736.000000 us

i = 19 / Time: 31082.000000 us

i = 20 / Time: 41454.000000 us

i = 21 / Time: 51819.000000 us

i = 22 / Time: 62181.000000 us

i = 23 / Time: 72520.000000 us

i = 24 / Time: 82859.000000 us

i = 25 / Time: 93210.000000 us

i = 26 / Time: 103538.000000 us

i = 27 / Time: 113922.000000 us

i = 28 / Time: 124269.000000 us

i = 29 / Time: 134645.000000 us

i = 30 / Time: 144996.000000 us

i = 31 / Time: 155375.000000 us

i = 32 / Time: 165742.000000 us

async_test.cu.txt (876 Bytes)

Jason102 · July 9, 2008, 6:10pm

You’re exactly right, that’s what it was. I used the time function instead to measure the difference and I saw it was in-fact asynchronous.

Topic		Replies	Views
100% CPU usage when running CUDA code CUDA Programming and Performance	5	4982	October 31, 2023
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022
processing time check CUDA Programming and Performance	5	551	November 16, 2010
Polling device variable while kernel is running CUDA Programming and Performance	5	2393	September 20, 2016
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1655	May 18, 2010
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1444	July 17, 2017
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7338	July 24, 2008
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25107	March 8, 2010
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13451	July 9, 2008

Kernel Runtime

Related topics