cudaMemcpyAsync copying back to same array from different streams!!

Hi all,
I missing something when I try to copy results from different streams to a same integer array. I actually am trying to do processing on a picture and do a strip-division of the independent work over multiple GPUs. I use different streams for each of the GPUs, and divide the work evenly on them.
On the asyncmemcpy back to the host, I use the same array as the destination , with changing stream parameters in a loop. I only get the result of the first GPU copied over all the rest of the image. So means that I do not get the full image back, but 1/nth of the image(first segment) copied over n over again.
I am posting the minimul code to make my point clear. Is there a solution to this problem?

cudaGetDeviceCount(&devices);
int *dp_temp;
for(int dev=0;dev<devices;dev++)
	{
		cudaSetDevice(dev);
		cudaStreamCreate(&stream[dev]);
		cudaMalloc((void**)&dp_temp,(XSIZE)*YSIZE*sizeof(int));

		//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));

	cudaError_t cudaStatus;
	cudaStatus = cudaSetDevice(dev);
	if(cudaStatus != cudaSuccess)
		fprintf(stderr,"Failed to set Device %d",dev);
    
    	cudaDeviceSynchronize();

	dim3 gridBlock((XSIZE/3)/BLOCKX, (YSIZE)/BLOCKY);
	dim3 threadBlock(BLOCKX, BLOCKY);
	dc <<<gridBlock, threadBlock,0,stream[dev]>>>(dp_temp+(dev*(XSIZE/3)*YSIZE),...,dev);
    
	cudaMemcpyAsync(device_pixel+(dev*(XSIZE/3)*YSIZE), dp_temp+(dev*(XSIZE/3)*YSIZE), (XSIZE/3)*YSIZE * sizeof(int), cudaMemcpyDeviceToHost,stream[dev]);
    
    cudaDeviceSynchronize();



	}



}

Thnks for consideration!!

Reading this, a few things come to mind

Permit me a few suggestions:

a) the way you have set up, you might as well use only 1 GPU
If I am not mistaken, your host is going to pause at every cudaDeviceSynchronize(), within the for loop, for each for loop iteration; hence, each device is only starting its work when the previous device stops, instead of all GPUs starting concurrently
Therefore, consider using multiple threads on the host side, each managing a gpu (launch, etc), or attempt to use callbacks, such that you can launch all work (for all GPUs) first, then wait for gpu completion

b) I believe you are overwriting your int* dp_temp with each for loop iteration; hence, you are essentially ‘forgetting’/ losing your device memory pointers
Rather use an array of int pointers int* dp_temp[dev]
Just check this

in addition to multiple threads on the host side, and callbacks, I suppose this should equally work:

for (no_of_gpus)
{
cudaSetDevice();

launch kernel;
}

for (no_of_gpus)
{
cudaSetDevice();

cudaDeviceSynchronize();

memory copy;
}

This is coming to me in bits and pieces, so forgive the multiple posts

//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));

might have been the better option, I would certainly use it instead

When copying memory from device to host, your source offset seems improper:

dp_temp+(dev*(XSIZE/3)*YSIZE); should perhaps only be dp_temp

And if you are not issuing multiple kernels per device, why are you using streams in the first place?
You only wish to issue 1 kernel per device, so I am confident that you can do this in the default stream

If you do not split your kernel launches and subsequent memory copies into 2 separate for loops, consider a cudaDeviceSynchronize between each kernel launch and the subsequent cudaMemcpyAsync