Hi all,
I missing something when I try to copy results from different streams to a same integer array. I actually am trying to do processing on a picture and do a strip-division of the independent work over multiple GPUs. I use different streams for each of the GPUs, and divide the work evenly on them.
On the asyncmemcpy back to the host, I use the same array as the destination , with changing stream parameters in a loop. I only get the result of the first GPU copied over all the rest of the image. So means that I do not get the full image back, but 1/nth of the image(first segment) copied over n over again.
I am posting the minimul code to make my point clear. Is there a solution to this problem?
cudaGetDeviceCount(&devices);
int *dp_temp;
for(int dev=0;dev<devices;dev++)
{
cudaSetDevice(dev);
cudaStreamCreate(&stream[dev]);
cudaMalloc((void**)&dp_temp,(XSIZE)*YSIZE*sizeof(int));
//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(dev);
if(cudaStatus != cudaSuccess)
fprintf(stderr,"Failed to set Device %d",dev);
cudaDeviceSynchronize();
dim3 gridBlock((XSIZE/3)/BLOCKX, (YSIZE)/BLOCKY);
dim3 threadBlock(BLOCKX, BLOCKY);
dc <<<gridBlock, threadBlock,0,stream[dev]>>>(dp_temp+(dev*(XSIZE/3)*YSIZE),...,dev);
cudaMemcpyAsync(device_pixel+(dev*(XSIZE/3)*YSIZE), dp_temp+(dev*(XSIZE/3)*YSIZE), (XSIZE/3)*YSIZE * sizeof(int), cudaMemcpyDeviceToHost,stream[dev]);
cudaDeviceSynchronize();
}
}
Thnks for consideration!!
Reading this, a few things come to mind
Permit me a few suggestions:
a) the way you have set up, you might as well use only 1 GPU
If I am not mistaken, your host is going to pause at every cudaDeviceSynchronize(), within the for loop, for each for loop iteration; hence, each device is only starting its work when the previous device stops, instead of all GPUs starting concurrently
Therefore, consider using multiple threads on the host side, each managing a gpu (launch, etc), or attempt to use callbacks, such that you can launch all work (for all GPUs) first, then wait for gpu completion
b) I believe you are overwriting your int* dp_temp with each for loop iteration; hence, you are essentially ‘forgetting’/ losing your device memory pointers
Rather use an array of int pointers int* dp_temp[dev]
Just check this
in addition to multiple threads on the host side, and callbacks, I suppose this should equally work:
for (no_of_gpus)
{
cudaSetDevice();
launch kernel;
}
for (no_of_gpus)
{
cudaSetDevice();
cudaDeviceSynchronize();
memory copy;
}
This is coming to me in bits and pieces, so forgive the multiple posts
//cudaMalloc((void**)&dp_temp[dev], (XSIZE/3)*YSIZE * sizeof(int));
might have been the better option, I would certainly use it instead
When copying memory from device to host, your source offset seems improper:
dp_temp+(dev*(XSIZE/3)*YSIZE); should perhaps only be dp_temp
And if you are not issuing multiple kernels per device, why are you using streams in the first place?
You only wish to issue 1 kernel per device, so I am confident that you can do this in the default stream
If you do not split your kernel launches and subsequent memory copies into 2 separate for loops, consider a cudaDeviceSynchronize between each kernel launch and the subsequent cudaMemcpyAsync