kernal and memcpy cannot overlap when using cudaMemcpyDeviceToDevicev in some situations

wuzhiyaohu · October 23, 2015, 7:20am

My GPU is Tesla K40.
It has one kernal engine and two copy engine.

I encountered a strange phenomenon when I test the performance of cudaMemcpyAsync as follows:

threads = dim3(512, 1);
blocks = dim3(n / threads.x, 1);

//time for device memcpy in one stream and kernal in other stream
checkCudaErrors(cudaEventRecord(start_event,0));

for(int i = 0 ; i < 5 ; ++i) {
checkCudaErrors(cudaMemcpyAsync(deviceMemoryDest_2, deviceMemorySrc_1, nbytes,cudaMemcpyDeviceToDevice,streams[0]));
}
for(int i = 0 ; i < 5 ; ++i) {
scaleVector<<<blocks, threads, 0, streams[1]>>>(deviceComputeMemory, deviceFactorMemory, num_iterations);
}

checkCudaErrors(cudaEventRecord(stop_event,0));
checkCudaErrors(cudaEventSynchronize(stop_event));
checkCudaErrors(cudaEventElapsedTime(&time, start_event, stop_event));
cout << “time is:\t” << time << endl;

It seems that the kernal and memcpy code above cannot overlap. But when I change some code as follows, they can overlap:

threads = dim3(512, 1);
blocks = dim3(n / threads.x, 1);

//time for device memcpy in one stream and kernal in other stream
checkCudaErrors(cudaEventRecord(start_event,0));

for(int i = 0 ; i < nreps ; ++i) {
checkCudaErrors(cudaMemcpyAsync(deviceMemoryDest_2, deviceMemorySrc_1, nbytes,cudaMemcpyDeviceToDevice,streams[0]));
scaleVector<<<blocks, threads, 0, streams[1]>>>(deviceComputeMemory, deviceFactorMemory, num_iterations);
}

checkCudaErrors(cudaEventRecord(stop_event,0));
checkCudaErrors(cudaEventSynchronize(stop_event));
checkCudaErrors(cudaEventElapsedTime(&time, start_event, stop_event));
cout << “time is:\t” << time << endl;

I think it is a very strange phenomenon and want to know what happened in the kernal engine and copy engine when I using cudaMemcpyDeviceToDevice.

little_jimmy · October 23, 2015, 10:14am

"It seems that the kernal and memcpy code above cannot overlap. "

what are you basing the observation on? your time measure?
i think, only with the profiler would you be able to have a detailed breakdown, clearly depicting what commences at what point
a crude time measure in itself may be too shallow to conclude the matter

Topic		Replies	Views
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3144	July 14, 2010
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1764	June 23, 2010
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1889	June 17, 2010
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1414	December 9, 2014
asynchronous cuMemcpyDtoD ? CUDA Programming and Performance	9	2405	December 9, 2008
Why the cuda kernel and copy do not overlap? CUDA Programming and Performance cuda	2	40	November 5, 2024
performance variation when using asynchronous calls CUDA Programming and Performance	1	621	February 11, 2011
Bug when overlapping tranfert & data CUDA Programming and Performance	1	565	February 11, 2011
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2812	June 24, 2016
Data transfer from host to two GPUs in a cluster (MultiGPU Programming) CUDA Programming and Performance	2	937	December 3, 2012

kernal and memcpy cannot overlap when using cudaMemcpyDeviceToDevicev in some situations

Related topics