Issue on parallelising memcpy

BrownZer · August 25, 2022, 10:12am

Dear experts,

I’m a beginner in CUDA programming.
I was following the to perform a parallelised host-to-device memcpy and I wrote the example code below.
However when I run the nsight systems “trace” I found that the memcpy’s don’t seem to be parallelised, as shown in the attached picture.

May i ask what was missed in the example code or this can be done by other approaches?

Thanks ahead!
Brown

	void Routine()
	{
		cudaStream_t s1;
		cudaStream_t s2;

		CUDA_CHECK(cudaStreamCreate(&s1));
		CUDA_CHECK(cudaStreamCreate(&s2));

		double* host_1;
		double* host_2;

		CUDA_CHECK(cudaHostAlloc(&host_1, MYSIZE, cudaHostAllocDefault));
		CUDA_CHECK(cudaHostAlloc(&host_2, MYSIZE, cudaHostAllocDefault));

		double* device_1;
		double* device_2;

		CUDA_CHECK(cudaMallocAsync(&device_1, MYSIZE, s1));
		CUDA_CHECK(cudaMallocAsync(&device_2, MYSIZE, s2));

		CUDA_CHECK(cudaMemcpyAsync(device_1, host_1, MYSIZE, cudaMemcpyHostToDevice, s1));
		CUDA_CHECK(cudaMemcpyAsync(device_2, host_2, MYSIZE, cudaMemcpyHostToDevice, s2));

		// clean up
		CUDA_CHECK(cudaFreeHost(host_1));
		CUDA_CHECK(cudaFreeHost(host_2));
		CUDA_CHECK(cudaFreeAsync(device_1, s1));
		CUDA_CHECK(cudaFreeAsync(device_2, s2));
		CUDA_CHECK(cudaStreamDestroy(s1));
		CUDA_CHECK(cudaStreamDestroy(s2));
	}

Robert_Crovella · August 25, 2022, 1:58pm

There isn’t any expectation that such operations would execute “in parallel”. The pipe that connects the host to the device has a particular finite bandwidth.

No, it can’t be done by other approaches. That is, there is nothing you can do to force the two operations to begin at the same time and run concurrently.

BrownZer · August 26, 2022, 1:15am

Thanks a lot for the reply! It clearly answers my question :)

system · September 9, 2022, 1:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5675	April 2, 2009
async memcpy only seems to overlap device->host CUDA Programming and Performance	0	967	August 17, 2009
some memcopy questions async, ping pong buffering, streaming CUDA Programming and Performance	5	3384	April 29, 2008
Multiple async memcpy CUDA Programming and Performance	1	6404	December 16, 2011
Questions on Streams CUDA Programming and Performance	5	2192	July 16, 2008
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1815	June 23, 2010
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2393	November 18, 2011
Asynchronicity of kernel execution and cuMemcpy CUDA Programming and Performance	2	3311	March 23, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	21641	October 16, 2015
cudaMemcpyAsync CUDA Programming and Performance	1	4888	December 8, 2008

Issue on parallelising memcpy

Related topics