Strange behavior with overlap of transfer and compute

laughingrice · October 17, 2011, 8:51pm

I’m trying to get concurrent copy and execute to work consistently with the Quadro 2000m, the 280.26 driver and cuda 4.0 (the 285 drivers are much worse at the moment). I was wondering if anyone can tell me what I’m doing wrong, what I’m missing or why this is the expected behavior.

I’m doing, memcpyHtoD, kernel, memcpyDtoH on 4 streams

Running the code on stream 0 (no overlap) takes 6.6ms

Following code runs at 7.93ms and nsight shows no overlap

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

Following code shows (almost) full overlap and runs at 5ms

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

Following code shows write overleap (memcpyHtoD) but no read overlap (memcpyDtoH) and runs at 6.17ms

cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

	cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

	cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

	cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

	kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

	kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

	kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

	kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

	cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

	cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

	cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

	cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

The way I see it, they all should behave the same and the first version is what I would actually want to do in a real program

Thanks

njuffa · October 17, 2011, 9:51pm

The order in which the copies and the kernels get launched is crucial for optimal performance. The following document describes this for Fermi-class GPUs in the context of CUDA Fortran, and includes helpful measurements and timing diagrams:

The general principles described in the above document apply to CUDA C as well, so don’t worry if you do not understand all the specifics of the Fortran code, simply think of it as pseudo code :-)

laughingrice · October 18, 2011, 1:52pm

Went over this again and it seems that the real life solution is to take the first version and to tweak it a bit. The requirement is to upload the data for the next kernel before downloading the data for the current one, so the solution is

[code]

    cudaMemcpyAsync(ddata1, hdata1, BUFSZ, cudaMemcpyHostToDevice, streams[0]);

    kernel <<< grid, block, 0, streams[0] >>> (ddata1, ddata1);

cudaMemcpyAsync(ddata2, hdata2, BUFSZ, cudaMemcpyHostToDevice, streams[1]);

    cudaMemcpyAsync(hdata1, ddata1, BUFSZ, cudaMemcpyDeviceToHost, streams[0]);

kernel <<< grid, block, 0, streams[1] >>> (ddata2, ddata2);

cudaMemcpyAsync(ddata3, hdata3, BUFSZ, cudaMemcpyHostToDevice, streams[2]);

    cudaMemcpyAsync(hdata2, ddata2, BUFSZ, cudaMemcpyDeviceToHost, streams[1]);

kernel <<< grid, block, 0, streams[2] >>> (ddata3, ddata3);

cudaMemcpyAsync(ddata4, hdata4, BUFSZ, cudaMemcpyHostToDevice, streams[3]);

    cudaMemcpyAsync(hdata3, ddata3, BUFSZ, cudaMemcpyDeviceToHost, streams[2]);

kernel <<< grid, block, 0, streams[3] >>> (ddata4, ddata4);

    cudaMemcpyAsync(hdata4, ddata4, BUFSZ, cudaMemcpyDeviceToHost, streams[3]);

[/coda]

The next issue is that the profiler chokes on this code for some reason. It shows times only for the first 7 kernel calls saying that some profiler rows are dropped due to incorrect gpu timestamp.

The kernel itself only does MAD in a loop to waste time, so that doesn’t seem to be the issue

Is the profiler limited in regards to how many kernels it can run?

thanks

njuffa · October 18, 2011, 2:53pm

The profiler may have a limit on the number of kernels it can track but it should certainly be more than seven. The problem with the incorrect GPU time stamps probably has some other reason. If the issues persists, and you are using the CUDA 4.0 toolchain, I would suggest filing a bug with a self-contained repro case and detailed information about the platform you are running on. Thanks.

Sanjiv.Satoor · October 19, 2011, 8:14am

Visual profiler version 4.0 has a bug in case of an application using multiple streams due to which rows can be dropped. Use the Visual Profiler Patch for CUDA Toolkit 4.0 ver 4.0.051. Refer The Official NVIDIA Forums | NVIDIA. Note that this patch is currently available only for Linux.

Topic		Replies	Views
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1468	January 23, 2009
streams not overlapping CUDA Programming and Performance	1	1584	May 23, 2011
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1815	June 23, 2010
Bug when overlapping tranfert & data CUDA Programming and Performance	1	600	February 11, 2011
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1909	June 17, 2010
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5675	April 2, 2009
Concurrent execution problem Try to understand how to achieve the data and execution concurrency CUDA Programming and Performance	4	1550	July 9, 2010
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3249	July 14, 2010
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4062	March 21, 2009
concurrent copy and execution CUDA Programming and Performance	0	1628	November 6, 2009

Strange behavior with overlap of transfer and compute

Related topics