Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe

Tessier · June 20, 2010, 5:05pm

I found a strange behavior using streams. The host->device memory copy is overlapped with the kernel execution, but the device->host copy waits for all kernel runs to be finished.

The following image shows the timing:

As you can see from the profiler output, streams are correctly used, all memories are pinned:

The SDK’s simplestream example is also affected by the problem, I get the following results:

I use a GTX480 in an MSI P7N motherboard (nForce780), CUDA toolkit 3.0, driver 257.21. Everything was run through VNC (hopefully this is not the reason…).

Code:

cudaStream_t stream[3]; 

	for (int i = 0; i < 3; ++i) 

		cudaStreamCreate(&stream[i]); 

	for (int i = 0; i < 3; ++i)

		cudaMemcpyAsync(outAddrGPU+i, outAddr+i, 

						4, cudaMemcpyHostToDevice, stream[i]);

	for (int i = 0; i < 3; ++i)

		cudaMemcpyAsync(dataGPU+i*1024*512*8, dataPtr+i*1024*512*8, 

						(1024*512*8+32)*sizeof(unsigned int), cudaMemcpyHostToDevice, stream[i]);  

	for (int i = 0; i < 3; ++i) 

		ref_bin_sm <<< dimGrid, dimThrd, 0, stream[i] >>> (dataGPU+i*1024*512*8, resGPU[i], outAddrGPU+i, 0, 0x3f, 0x0);

	for (int i = 0; i < 3; ++i) 

		cudaMemcpyAsync(resPtr+i*(128*1024*1024/8), resGPU[i], 

						(80*1024*1024/8)*sizeof(unsigned long long), cudaMemcpyDeviceToHost, stream[i]);

Tessier · June 23, 2010, 1:56pm

I have modified the host code in order to have host->device copies on stream0 and (kernel runs + device->host copies) in three other streams. Moreover, I put the three streams into a for loop, so that all streams run a kernel and memcopy 3 times. This way concurrent execution works more or less as I would have imagined.

Still, the first memcopy waits for all kernel runs to be finished, but at least further kernels and copies are overlapped. I am still not really satisfied with the results, as I thought streams are command queues where a command executes whenever it is possible (that is all previous commands from the same stream were executed and the required resource is free). But it seems to me that streams are more limited.

I have a more general problem with cudaMemcpyAsync.In my case, the number of data generated is not known before the kernel runs, therefore a global memory variable contains the number of outputs. The idea is to read this variable first, and then do a memcopy based on this value. That is:
(1) run the kernel
(2) memcopy number of results: RES
(3) memcopy RES output words
Unfortunately it does not work with cudaMemcpyAsync, as the number of bytes to be copied is passed to memcopy based on the value, not based on the reference. So when the second memcopy is queued into the stream this value is zero, it only gets the real value when the first memcopy was executed. Passing this parameter by reference would solve this kind of problem but I cannot really see a way to implement this with the current cudaMemcpyAsync function.
If an NVIDIA employee reads this maybe he can answer: is it possible to have a memcopy which passes all parameters by reference?
If anyone has an idea how to do a hack, I would really appreciate it.

Topic		Replies	Views
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1097	December 15, 2022
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1425	December 9, 2014
cudaMemcpyAsync copying back to same array from different streams!! CUDA Programming and Performance	3	1108	May 21, 2014
Kernel Queueing CUDA Programming and Performance	8	9687	June 29, 2009
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	993	February 1, 2022
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	148	November 18, 2024
Question about streams CUDA Programming and Performance	1	984	August 6, 2009
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	891	July 7, 2017
cudaMemcpyAsync CUDA Programming and Performance	10	21078	October 16, 2015
why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec CUDA Programming and Performance	2	4685	April 4, 2013

Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe

Related topics