Simultaneous execution of multiple kernels


I am currently trying to figure out whether it is possible for the GPU to execute multiple kernels simultaneously.

I think the documentation is not very clear about this point, and I am really confused by the following experiment :

Have a look at this screenshot of cudaprof :

I have two kernels that are launched : (memcpy removed for clarity)

  • (almost) at the same time in distinct streams
// PUT DATA 1 and 2 ...

	  kernelSpl2FusDevice <<< nBlocks, blockSize, 0, stream2 >>> ( b1_dev, b2_dev, b3_dev );

	  kernelSpl1FusDevice <<< nBlocks, blockSize, 0, stream1 >>> ( a1_dev, a2_dev, a3_dev );

	  // GET DATA 1 and 2 ...

	  cudaStreamSynchronize ( stream1 );

	  cudaStreamSynchronize ( stream2 );
  • separately
// PUT DATA 1 ...

	  kernelSpl1SepDevice <<< nBlocks, blockSize, 0, stream1 >>> ( a1_dev, a2_dev, a3_dev );

	  // GET DATA 1 ...

	  cudaStreamSynchronize ( stream1 );

	  // PUT DATA 2 ...

	  kernelSpl2SepDevice <<< nBlocks, blockSize, 0, stream2 >>> ( b1_dev, b2_dev, b3_dev );

	  cudaStreamSynchronize ( stream1 );

	  // GET DATA 2 ...

Now here is the profiling output :

timestamp	   method				  gputime	cputime  occupancy

92809   _Z19kernelSpl2FusDevicePfS_S_	91.584	  14		   0.5

92832   _Z19kernelSpl1FusDevicePfS_S_	36.128	  3			 0.667

94266   _Z19kernelSpl1SepDevicePfS_S_	35.712	  4			 0.667

95248   _Z19kernelSpl2SepDevicePfS_S_	91.904	  4			 0.5

According to those numbers (and the plot above), you may think that kernelSpl1Fus() is running simultaneously with kernelSpl2Fus(), check timestamps and gpu timings.

But is it really the case ? Are the timestamp, gputime, cputime values reliable ?

With my own timing (using either gettimeofday() or cutStartTimer()), I got the following numbers :

fus (spl1 and spl2) : 220

sep (spl1) : 59

sep (spl2) : 115

It turns out that those timings contradict the profiling data.

We can see how spl1 and spl2 are conflicting when executed together (220 > 59+115).

So I am kind of lost here. I would like to know what really happens on the GPU.

Is it possible that the CUDA scheduler allocates some multiprocessors to kernel 1, and some others to kernel 2 ?

If no, how come the profiling tool returns that plot ? How are the timestamps, gputime values collected ?

Best Regards,


At the moment, a CUDA device can only run one kernel at a time. You can however, asynchronously queue up several kernels (24 or 32) to run in sequence, which is what you see here. The profiler appears to be recording the time you submitted the kernel for execution, but not the actual time it started.

Ok, thank you very much for your answer.

  • Are there any plan for such simultaneous execution ? Where can I find some infos ?

  • Can the timestamp issue be considered as a bug of cudaprof ?

No, NVIDIA has not announced any plan to support simultaneous execution of kernels. The currently preferred method to run two kernels at once is to install two cards. (Note that you then have to run two threads in your program, one for each card.)

Maybe. I think we would need some comment on this from an NVIDIA developer to know what the intended behavior is.

There might be. There’d be some very nice efficiency improvements, especially once CUDA goes mass-market. (E.g., a kernel wouldn’t need to be so big as to occupy the whole GPU. A calc-heavy kernel could synergize with a bandwidth-heavy one.) But, NVIDIA has not stated any plans (and it rarely does). It’s not coming anytime soon, at least.