Strange behavior with overlap of transfer and compute

I’m trying to get concurrent copy and execute to work consistently with the Quadro 2000m, the 280.26 driver and cuda 4.0 (the 285 drivers are much worse at the moment). I was wondering if anyone can tell me what I’m doing wrong, what I’m missing or why this is the expected behavior.

I’m doing, memcpyHtoD, kernel, memcpyDtoH on 4 streams

Running the code on stream 0 (no overlap) takes 6.6ms

Following code runs at 7.93ms and nsight shows no overlap

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

Following code shows (almost) full overlap and runs at 5ms

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

Following code shows write overleap (memcpyHtoD) but no read overlap (memcpyDtoH) and runs at 6.17ms

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

The way I see it, they all should behave the same and the first version is what I would actually want to do in a real program

Thanks

The order in which the copies and the kernels get launched is crucial for optimal performance. The following document describes this for Fermi-class GPUs in the context of CUDA Fortran, and includes helpful measurements and timing diagrams:

http://www.pgroup.com/lit/articles/insider/v3n1a4.htm

The general principles described in the above document apply to CUDA C as well, so don’t worry if you do not understand all the specifics of the Fortran code, simply think of it as pseudo code :-)

Went over this again and it seems that the real life solution is to take the first version and to tweak it a bit. The requirement is to upload the data for the next kernel before downloading the data for the current one, so the solution is

[code]

    cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

    kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

    cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

    cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

    cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

    cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

[/coda]

The next issue is that the profiler chokes on this code for some reason. It shows times only for the first 7 kernel calls saying that some profiler rows are dropped due to incorrect gpu timestamp.

The kernel itself only does MAD in a loop to waste time, so that doesn’t seem to be the issue

Is the profiler limited in regards to how many kernels it can run?

thanks

The profiler may have a limit on the number of kernels it can track but it should certainly be more than seven. The problem with the incorrect GPU time stamps probably has some other reason. If the issues persists, and you are using the CUDA 4.0 toolchain, I would suggest filing a bug with a self-contained repro case and detailed information about the platform you are running on. Thanks.

Visual profiler version 4.0 has a bug in case of an application using multiple streams due to which rows can be dropped. Use the Visual Profiler Patch for CUDA Toolkit 4.0 ver 4.0.051. Refer http://forums.nvidia.com/index.php?showtopic=210289. Note that this patch is currently available only for Linux.