My program is a nonlinear iterative solver. Here’s the pseudocode:
main()
{
cpu_initialize();
gpu_initialize();
cpuTime = 0;
gpuTime = 0;
for (iter = 0; iter < 2500; iter++)
{
t1 = get_time();
for (inner = 0; inner < 1; inner++)
gpu_computations();
t2 = get_time();
cpu_computations();
t3 = get_time();
gpuTime += t2-t1;
cpuTime += t3-t2;
}
}
gpu_computations()
{
// before - clear output buffer, copy input data to gpu
cudaMemset(d_outputBuffer, 0, 12 MB);
cudaMemcpy(d_inputBuffer, h_inputBuffer, 256 kB, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
// calculate
gpu_kernel <<<64, 64>>> (arguments);
cudaThreadSynchronize();
// after - copy output data from gpu
cudaMemcpy(h_outputBuffer, d_outputBuffer, 12 MB, cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
}
Each call to cpu_computations() takes around 20 ms, while each call to gpu_computations() takes around 13 ms. If I change the inner iteration loop so gpu_computations() is called multiple times (doesn’t do anything except make the GPU do the same work over and over), then each call to gpu_computations() takes around 7 ms.
1 gpu call = 13 ms average = 13 ms
2 gpu calls = 14 ms average = 7 ms
3 gpu calls = 21 ms average = 7 ms
4 gpu calls = 28 ms average = 7 ms
5 gpu calls = 35 ms average = 7 ms
100 gpu calls = 700 ms average = 7 ms
Why does one call to gpu_calculations() takes 13 ms but two or more successive calls take N*7 ms? I understand there is a startup overhead with Cuda, but that should occur only at iter=0. It shouldn’t show up in every iteration, and it doesn’t exist if it is called two or more times.
BandwidthTest shows 5.7-6.0 GB/s depending on which way I’m going, using pinned memory as my program does. This is the only part of the program using the GPU, there are no other programs using the GPU, no X server running either. Linux, GTX280, Cuda 1.1, latest driver.
I modified the code so I timed the before / calculate / after sections separately:
1 call 2 or more calls
before 2.1 ms N*1.5 ms
calc 8.6 ms N*3.9 ms
after 2.2 ms N*1.8 ms
-------------------------
total 12.8 ms N*7.2 ms
There’s some extra overhead in the memory copy to/from the GPU, but that time is close to what I expect based on the memory bandwidth that bandwidthTest reports. Most of the overhead shows up in the call to the gpu kernel; that’s what I’m trying to understand.