cudaMemcpyAsync copying back to same array from different streams!!

cudacodemonkey · May 21, 2014, 10:29am

Hi all,
I missing something when I try to copy results from different streams to a same integer array. I actually am trying to do processing on a picture and do a strip-division of the independent work over multiple GPUs. I use different streams for each of the GPUs, and divide the work evenly on them.
On the asyncmemcpy back to the host, I use the same array as the destination , with changing stream parameters in a loop. I only get the result of the first GPU copied over all the rest of the image. So means that I do not get the full image back, but 1/nth of the image(first segment) copied over n over again.
I am posting the minimul code to make my point clear. Is there a solution to this problem?

cudaGetDeviceCount(&devices);
int *dp_temp;
for(int dev=0;dev<devices;dev++)
	{
		cudaSetDevice(dev);
		cudaStreamCreate(&stream[dev]);
		cudaMalloc((void**)&dp_temp,(XSIZE)*YSIZE*sizeof(int));

		//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));

	cudaError_t cudaStatus;
	cudaStatus = cudaSetDevice(dev);
	if(cudaStatus != cudaSuccess)
		fprintf(stderr,"Failed to set Device %d",dev);
    
    	cudaDeviceSynchronize();

	dim3 gridBlock((XSIZE/3)/BLOCKX, (YSIZE)/BLOCKY);
	dim3 threadBlock(BLOCKX, BLOCKY);
	dc <<<gridBlock, threadBlock,0,stream[dev]>>>(dp_temp+(dev*(XSIZE/3)*YSIZE),...,dev);
    
	cudaMemcpyAsync(device_pixel+(dev*(XSIZE/3)*YSIZE), dp_temp+(dev*(XSIZE/3)*YSIZE), (XSIZE/3)*YSIZE * sizeof(int), cudaMemcpyDeviceToHost,stream[dev]);
    
    cudaDeviceSynchronize();



	}



}

Thnks for consideration!!

little_jimmy · May 21, 2014, 10:51am

Reading this, a few things come to mind

Permit me a few suggestions:

a) the way you have set up, you might as well use only 1 GPU
If I am not mistaken, your host is going to pause at every cudaDeviceSynchronize(), within the for loop, for each for loop iteration; hence, each device is only starting its work when the previous device stops, instead of all GPUs starting concurrently
Therefore, consider using multiple threads on the host side, each managing a gpu (launch, etc), or attempt to use callbacks, such that you can launch all work (for all GPUs) first, then wait for gpu completion

b) I believe you are overwriting your int* dp_temp with each for loop iteration; hence, you are essentially ‘forgetting’/ losing your device memory pointers
Rather use an array of int pointers int* dp_temp[dev]
Just check this

little_jimmy · May 21, 2014, 11:15am

in addition to multiple threads on the host side, and callbacks, I suppose this should equally work:

for (no_of_gpus)
{
cudaSetDevice();

launch kernel;
}

for (no_of_gpus)
{
cudaSetDevice();

cudaDeviceSynchronize();

memory copy;
}

little_jimmy · May 21, 2014, 1:07pm

This is coming to me in bits and pieces, so forgive the multiple posts

//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));

might have been the better option, I would certainly use it instead

When copying memory from device to host, your source offset seems improper:

dp_temp+(dev*(XSIZE/3)*YSIZE); should perhaps only be dp_temp

And if you are not issuing multiple kernels per device, why are you using streams in the first place?
You only wish to issue 1 kernel per device, so I am confident that you can do this in the default stream

If you do not split your kernel launches and subsequent memory copies into 2 separate for loops, consider a cudaDeviceSynchronize between each kernel launch and the subsequent cudaMemcpyAsync

Topic		Replies	Views
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1761	June 23, 2010
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	985	December 15, 2022
How to use streams for asynch transfers CUDA Programming and Performance	3	873	February 18, 2011
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	964	February 1, 2022
asynchronous cuMemcpyDtoD ? CUDA Programming and Performance	9	2401	December 9, 2008
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1749	October 17, 2009
Help with CUDA streams CUDA Programming and Performance	1	1599	April 2, 2010
Streams and CPU CUDA Programming and Performance	1	1024	September 27, 2013
I want to synchronize CUDA streams CUDA Programming and Performance	5	683	January 5, 2024
CUDA stream CUDA Programming and Performance	1	4649	April 11, 2010

cudaMemcpyAsync copying back to same array from different streams!!

Related topics