I am currently trying to figure out whether it is possible for the GPU to execute multiple kernels simultaneously.
I think the documentation is not very clear about this point, and I am really confused by the following experiment :
Have a look at this screenshot of cudaprof :
I have two kernels that are launched : (memcpy removed for clarity)
- (almost) at the same time in distinct streams
// PUT DATA 1 and 2 ... kernelSpl2FusDevice <<< nBlocks, blockSize, 0, stream2 >>> ( b1_dev, b2_dev, b3_dev ); kernelSpl1FusDevice <<< nBlocks, blockSize, 0, stream1 >>> ( a1_dev, a2_dev, a3_dev ); // GET DATA 1 and 2 ... cudaStreamSynchronize ( stream1 ); cudaStreamSynchronize ( stream2 );
// PUT DATA 1 ... kernelSpl1SepDevice <<< nBlocks, blockSize, 0, stream1 >>> ( a1_dev, a2_dev, a3_dev ); // GET DATA 1 ... cudaStreamSynchronize ( stream1 ); // PUT DATA 2 ... kernelSpl2SepDevice <<< nBlocks, blockSize, 0, stream2 >>> ( b1_dev, b2_dev, b3_dev ); cudaStreamSynchronize ( stream1 ); // GET DATA 2 ...
Now here is the profiling output :
timestamp method gputime cputime occupancy 92809 _Z19kernelSpl2FusDevicePfS_S_ 91.584 14 0.5 92832 _Z19kernelSpl1FusDevicePfS_S_ 36.128 3 0.667 94266 _Z19kernelSpl1SepDevicePfS_S_ 35.712 4 0.667 95248 _Z19kernelSpl2SepDevicePfS_S_ 91.904 4 0.5
According to those numbers (and the plot above), you may think that kernelSpl1Fus() is running simultaneously with kernelSpl2Fus(), check timestamps and gpu timings.
But is it really the case ? Are the timestamp, gputime, cputime values reliable ?
With my own timing (using either gettimeofday() or cutStartTimer()), I got the following numbers :
fus (spl1 and spl2) : 220 sep (spl1) : 59 sep (spl2) : 115
It turns out that those timings contradict the profiling data.
We can see how spl1 and spl2 are conflicting when executed together (220 > 59+115).
So I am kind of lost here. I would like to know what really happens on the GPU.
Is it possible that the CUDA scheduler allocates some multiprocessors to kernel 1, and some others to kernel 2 ?
If no, how come the profiling tool returns that plot ? How are the timestamps, gputime values collected ?