Unexplained GPU call overhead one call 13ms; N calls N*7 ms

e.ping · July 18, 2008, 4:56pm

My program is a nonlinear iterative solver. Here’s the pseudocode:

main() 

{

    cpu_initialize();

    gpu_initialize();

    cpuTime = 0;

    gpuTime = 0;

    for (iter = 0; iter < 2500; iter++)

    {

        t1 = get_time();

        for (inner = 0; inner < 1; inner++)

            gpu_computations();

        t2 = get_time();

        cpu_computations();

        t3 = get_time();

        gpuTime += t2-t1;

        cpuTime += t3-t2;

     }

}

gpu_computations()

{

    // before - clear output buffer, copy input data to gpu

    cudaMemset(d_outputBuffer, 0, 12 MB);

    cudaMemcpy(d_inputBuffer, h_inputBuffer, 256 kB, cudaMemcpyHostToDevice);

    cudaThreadSynchronize();

   // calculate

    gpu_kernel <<<64, 64>>> (arguments);

    cudaThreadSynchronize();

   // after - copy output data from gpu

    cudaMemcpy(h_outputBuffer, d_outputBuffer, 12 MB, cudaMemcpyDeviceToHost);

    cudaThreadSynchronize();

}

Each call to cpu_computations() takes around 20 ms, while each call to gpu_computations() takes around 13 ms. If I change the inner iteration loop so gpu_computations() is called multiple times (doesn’t do anything except make the GPU do the same work over and over), then each call to gpu_computations() takes around 7 ms.

1 gpu call = 13 ms average = 13 ms

2 gpu calls =  14 ms  average =  7 ms

3 gpu calls =  21 ms  average =  7 ms

4 gpu calls =  28 ms  average =  7 ms

5 gpu calls =  35 ms  average =  7 ms

100 gpu calls = 700 ms average = 7 ms

Why does one call to gpu_calculations() takes 13 ms but two or more successive calls take N*7 ms? I understand there is a startup overhead with Cuda, but that should occur only at iter=0. It shouldn’t show up in every iteration, and it doesn’t exist if it is called two or more times.

BandwidthTest shows 5.7-6.0 GB/s depending on which way I’m going, using pinned memory as my program does. This is the only part of the program using the GPU, there are no other programs using the GPU, no X server running either. Linux, GTX280, Cuda 1.1, latest driver.

I modified the code so I timed the before / calculate / after sections separately:

           1 call   2 or more calls

    before  2.1 ms   N*1.5 ms

    calc    8.6 ms   N*3.9 ms

    after   2.2 ms   N*1.8 ms

    -------------------------

    total  12.8 ms   N*7.2 ms

There’s some extra overhead in the memory copy to/from the GPU, but that time is close to what I expect based on the memory bandwidth that bandwidthTest reports. Most of the overhead shows up in the call to the gpu kernel; that’s what I’m trying to understand.

senorbum · July 18, 2008, 6:14pm

My program is a nonlinear iterative solver. Here’s the pseudocode:
main() 

{

    cpu_initialize();

    gpu_initialize();

    cpuTime = 0;

    gpuTime = 0;

    for (iter = 0; iter < 2500; iter++)

    {

        t1 = get_time();

        for (inner = 0; inner < 1; inner++)

            gpu_computations();

        t2 = get_time();

        cpu_computations();

        t3 = get_time();

        gpuTime += t2-t1;

        cpuTime += t3-t2;

     }

}

gpu_computations()

{

    // before - clear output buffer, copy input data to gpu

    cudaMemset(d_outputBuffer, 0, 12 MB);

    cudaMemcpy(d_inputBuffer, h_inputBuffer, 256 kB, cudaMemcpyHostToDevice);

    cudaThreadSynchronize();

   // calculate

    gpu_kernel <<<64, 64>>> (arguments);

    cudaThreadSynchronize();

   // after - copy output data from gpu

    cudaMemcpy(h_outputBuffer, d_outputBuffer, 12 MB, cudaMemcpyDeviceToHost);

    cudaThreadSynchronize();

}
Each call to cpu_computations() takes around 20 ms, while each call to gpu_computations() takes around 13 ms. If I change the inner iteration loop so gpu_computations() is called multiple times (doesn’t do anything except make the GPU do the same work over and over), then each call to gpu_computations() takes around 7 ms.

1 gpu call = 13 ms average = 13 ms

2 gpu calls = 14 ms average = 7 ms

3 gpu calls = 21 ms average = 7 ms

4 gpu calls = 28 ms average = 7 ms

5 gpu calls = 35 ms average = 7 ms

100 gpu calls = 700 ms average = 7 ms

Why does one call to gpu_calculations() takes 13 ms but two or more successive calls take N*7 ms? I understand there is a startup overhead with Cuda, but that should occur only at iter=0. It shouldn’t show up in every iteration, and it doesn’t exist if it is called two or more times.

BandwidthTest shows 5.7-6.0 GB/s depending on which way I’m going, using pinned memory as my program does. This is the only part of the program using the GPU, there are no other programs using the GPU, no X server running either. Linux, GTX280, Cuda 1.1, latest driver.

I modified the code so I timed the before / calculate / after sections separately:
           1 call   2 or more calls

    before  2.1 ms   N*1.5 ms

    calc    8.6 ms   N*3.9 ms

    after   2.2 ms   N*1.8 ms

    -------------------------

    total  12.8 ms   N*7.2 ms
There’s some extra overhead in the memory copy to/from the GPU, but that time is close to what I expect based on the memory bandwidth that bandwidthTest reports. Most of the overhead shows up in the call to the gpu kernel; that’s what I’m trying to understand.

[snapback]412718[/snapback]

Doesn’t initializing the GPU the first time always take longer? Are you calling a shutdown at any point inside

for (iter = 0; iter < 2500; iter++)

{

    t1 = get_time();

    for (inner = 0; inner < 1; inner++)

    <b>Is your init() somewhere around here</b>

        gpu_computations();

    t2 = get_time();

    <b>Are you calling a shutdown here</b>

    cpu_computations();

    t3 = get_time();

    gpuTime += t2-t1;

    cpuTime += t3-t2;

 }

If you are calling a shutdown and don’t include that in the loop, then yes, it will take longer.

e.ping · July 18, 2008, 8:01pm

GPU startup and initialization is done in gpu_initialize(), before the loop. The only Cuda calls made inside the loop are in gpu_computations() as shown. I don’t have an explicit GPU shutdown; I just let the program exit normally.

I just tested this with Cuda 2.0 Beta 2, and the same problem exists there.

E.D_Riedijk · July 19, 2008, 5:36am

the first time you run a kernel, there is some transfer of the microcode (or what it is called) for that kernel to the GPU. That is the overhead you are seeing.

paulius · July 29, 2008, 6:50pm

Rick,

a couple of comments:

make sure to always call cudaThreadSynchronize() before starting and stopping any CPU timer. If you don’t, your measurements may include preceding CUDA calls, since a number of them are asynchronous from CPU’s point of view.
Could you add CUDA event timing of your kernel (inside gpu_compuatation)? It could look something like:

cudaEvent_t start, stop;

float elapsed_time_ms = 0.0f;

cudaEventCreate( &start ); Â  cudaEventCreate( &stop );

cudaEventRecord( start );

gpu_kernel <<<64, 64>>> (arguments);

cudaEventRecord( stop );

cudaThreadSynchronize( );

cudaEventElapsedTime( &et, start, stop);

Then see if the time for the first kernel, and subsequent ones is different. Let us know what you observe, we can go from there after that.

Paulius

P.S. You could also run this through the profiler. Profiler GPU times are essentially measured with events.

e.ping · July 29, 2008, 11:50pm

I used cudaEventRecord(start, 0) since you didn’t specify the second argument. I also called cudaEventDestroy(), otherwise the program crashes after 2048 event pairs. I should probably reuse the start and stop events, but I’m not concerned about wasting a little time on the CPU side while puzzling out what the GPU is doing.

I printed the time needed in ms for each call to gpu_kernel, with the GPU call repeated 1, 2, and 3 times. The very first call takes longer, as expected. Below are the average times for various groups of iterations.

                GPU-1x  GPU-2x  GPU-3x

iter=1           4.36    4.36    4.47

iter=2..1649     2.42    2.42    2.42

iter=1650..2545  6.47    2.42    2.42

So something is going wrong starting at iteration 1650, but only when the GPU is only called once. If I run it multiple times it doesn’t always start at iteration 1650, but may vary by 2-3 iterations. Once it goes up, it doesn’t come back down again. The standard deviation of the times is very small, at 0.02 ms.

e.ping · August 7, 2008, 6:32pm

Closing comment: Working with Nvidia, they traced it to a bug in the Linux driver that only showed up on my Dell T7400 workstation. It will be fixed in the next release of the driver.