Overlapping CPU and GPU code.


Assume I have N streams.

I launch N kernels using those N streams and they write to device memory.

// d_Counter[N] is an array of pointers.
for(int i =0; i <N; i++)
Kernel_1<< dimGrid, dimBlock, 0, stream[i]>>(d_Counter[N]);

I read the written values using the same N streams using CudaMemcpyAsync using the N streams and copy the results to the host.

//h_Counter is page locked memory.
for(int i =0; i <N; i++)
cudaMemcpyAsync((void*)(h_Counter + i), d_Counter[N], sizeof(unsigned int), cudaMemcpyDeviceToHost, stream[i]));

Based on the results I launch N kernels using the N streams.

for(int i =0; i <N; i++)
Kernel_2<< hCounter[i], dimBlock2, 0, stream[i]>>();

Then I call a cpu function.


I have two questions.

  1. Am I right in assuming that Kernel_2 will be launched only after hCounter[i] has the proper value that has been copied from device memory as streams are supposed to by synchronous? I am getting confused as I am using cudaMemcpyAsync. Do I need to call CudaStreamSynchronize(stream[i])?

  2. If I want to overlap some CPU code with GPU code that is run cpuFUNC(); will it overlap only with Kernel2 or with Kernel_1, CudaMemcpyAsync and Kernel_2 ?

Thank you very much.

  1. As it is currently written, there is no guarantee that hCounter[i] here:
Kernel_2<< hCounter[i], dimBlock2, 0, stream[i]>>();

contains the value that would be deposited by this:

cudaMemcpyAsync((void*)(h_Counter + i), d_Counter[N], sizeof(unsigned int), cudaMemcpyDeviceToHost, stream[i]));

If you want hCounter[i] to reflect the value provided by the cudaMemcpyAsync operation, you would need a stream synchronize or other synchronizing item after the cudaMemcpyAsync, and before the kernel launch.

  1. Your second question is not entirely clear. As written, cpuFUNC() code (assuming it is host code) could overlap in time with GPU operations occurring as early as Kernel_1 launches, or possibly even GPU activity launched prior to that. You have no GPU synchronizing activity in the code you have shown. It is all asynchronously launched, with respect to host code.

Thank you txbob for your answer.

Regarding Question 1.

I am finding good results with my program and it seems to be working but I cannot verify it exactly because in debug mode gpu operations are always synchronous, but the results that I am getting seems to be synchronous and if h_Counter had garbage value it would have crashed.

Regarding Question 2.

I am confused and thinking that the host code will overlap only with Kernel_2 because I am using streams and gpu operations on a same stream are supposed to be synchronized so it seems that Kernel_2 in stream “i” will launch only after Kernel_1 in stream “i” finishes thus the cpu will wait to launch Kernel_2 in stream “i”. So it seems that the cpu code will only overlap with Kernel_2.

Thank you very much. But as regards my current problem I just found that cpuFunc does not take much time so overlapping will not be an issue with my current work.

Thanks again. :)

I’m not sure what that means.

cudaMemcpyAsync will exhibit synchronous behavior if the underlying host allocation is not a pinned allocation. Since you haven’t shown any real code, I’m just pointing that out as a possibility. In any event, your posted code pattern demonstrates a vulnerability as I described. Here’s a fully worked example:

$ cat t1092.cu
#include <stdio.h>
#define TDELAY 1000000000ULL
__global__  void Kernel_1(int *counter){

  unsigned long long dt = clock64();
  while (clock64() < dt + TDELAY);
  if (threadIdx.x == 0)  atomicAdd(counter, 1);

__global__ void Kernel_2(int *counter){
  if (threadIdx.x == 0) atomicAdd(counter, 1);

int main(){

  int *d_Counter, *d_Counter2, *h_Counter;
  cudaMalloc(&d_Counter, sizeof(int));
  cudaMalloc(&d_Counter2, sizeof(int));
  cudaHostAlloc(&h_Counter, sizeof(int), cudaHostAllocDefault);
  cudaStream_t st;
  cudaMemset(d_Counter, 0, sizeof(int));
  cudaMemset(d_Counter2, 0, sizeof(int));
  *h_Counter = 1;
  Kernel_1<<<2, 1, 0, st>>>(d_Counter);
  cudaMemcpyAsync(h_Counter, d_Counter, sizeof(int), cudaMemcpyDeviceToHost, st);
#ifdef FIX
  Kernel_2<<<*h_Counter, 1, 0, st>>>(d_Counter2);
  int blocks = 0;
  cudaMemcpy(&blocks, d_Counter2, sizeof(int), cudaMemcpyDeviceToHost);
  printf("Kernel_2 ran with %d blocks\n", blocks);
  return 0;

$ nvcc -o t1092 t1092.cu
$ ./t1092
Kernel_2 ran with 1 blocks
$ nvcc -DFIX t1092.cu -o t1092
$ ./t1092
Kernel_2 ran with 2 blocks

Without the additional synchronization inserted with the FIX compilation, the second kernel launch picks up an old “stale” version of *h_Counter.

Thanks a lot txbob.

I used pinned memory.


If you look at page 11 of the presentation it says operations within a stream do not overlap. For safety i put the cudaDeviceSynchronize but when I removed that I did not see any change in the visual result of my program. I think I will do more debugging and surely post my results.

If you look at the CUDA programming guide:


“The ordering of kernel launches from the device runtime follows CUDA Stream ordering semantics. Within a thread block, all kernel launches into the same stream are executed in-order.”

C.2.1.4. Streams and Events
“CUDA Streams and Events allow control over dependencies between grid launches: grids launched into the same stream execute in-order, and events may be used to create dependencies between streams.”

I understand if I need to use the value of h_Counter in the cpu only after the launch of Kernel 1 and the cudaMemcpyAsync then if I do not use a cudaDeviceSynchronize after the cudaMemcpyAsync it will pick a stale value , but I am launching another kernel with the value of h_Counter and that launch is in the same stream and since kernel launches in the same stream are supposed to be synchronous I am thinking that the cpu will wait for the cudaMemcpyAsync to finish before launching Kernel_2.

I was wrong in saying that in debug mode asynchronous operations are synchronous. It is actually optional to make them synchronous:

From the CUDA programming guide :

"Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked."

The ordering of kernel launches within a stream will indeed follow stream issue ordering.

This doesn’t mean that your proposed approach is safe, and I believe my sample demonstrates that it is not.

The asynchronous nature of a kernel launch means that it is placed into a queue prior to the point at which it begins executing.

The launch characteristics (eg. blocks, threads) are determined when the kernel is placed into the launch queue. It is not released from the queue until all prior CUDA activity issued to the same stream is complete, but the characteristics of the launch have already been determined and recorded in the queue.

The execution of a kernel is a separate activity from the queueing of a kernel for launch.

so this statement is incorrect:

" I am thinking that the cpu will wait for the cudaMemcpyAsync to finish before launching Kernel_2. "

Insofar as this characteristic is concerned, the CPU does not wait.

A simple thought experiment will prove to you that the suggested “wait” cannot occur.

Kernel launches are asynchronous. If a CPU thread had to “wait” to launch a kernel, it could not continue on with activities after the kernel launch. Therefore there is a launch queueing mechanism, so that the CPU can record the necessity for the kernel launch, and then continue on.

The process of queuing the kernel for launch includes recording the parameters in the launch configuration i.e. <<<…>>>

Changes to the value of those parameters after the queuing of the launch, will have no effect on the parameters recorded in the launch queue.

Again, my example demonstrates exactly this.

Thanks a lot for this awesome explanation txbob !!! I appreciate it very much. I will put back the synchronization in my codes.

Thank you!