how can I use 2 GPU and split the work between them

hi all,

I’m trying to do matrix multiplication with two GPUs to let device 0 work for Upper half of matrix C is and device 1 for lower half of matrix C …using zero copy .

first ,I don’t know do we have to use one kernel or 2 kernel ?

second ,how can I control the upper part and the second part ?

third,do I have to use cudaMemcpyAsync() ?

:wallbash: :wallbash: :wallbash:

what I did is like this

//device 0 //

cudaGetDeviceProperties(&prop, 0); 

if (!prop.canMapHostMemory) 

exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost); 

//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0); 

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);

kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);

//device 1//

cudaGetDeviceProperties(&prop, 1); 

if (!prop.canMapHostMemory) 

exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost); 

//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0); 

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);

kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);

lookind foroward to some help .

Thanks

Hi,
It that normal I don’t see any “cudaSetDevice()” in your code?

You need use streams and the cudasetdevice to issues kernel calls on different devices.

isin’t that enough to use cudaSetDevice()??

like this
///Device 0////

cudaGetDeviceProperties(&prop, 0);

if (!prop.canMapHostMemory) 
	
	exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost);

Then…

///Device 1////

cudaGetDeviceProperties(&prop, 1);

if (!prop.canMapHostMemory) 
	
	exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost);

:confused: :confused:

or just
cudaGetDevice(0)
do something
cudaGetDevice(2)

then how I can assign each device to do something still don’t get the idea :wallbash:
please could any one give me the steps in order :blink: .

Thank you

Hello,

You can so something like this:

cudaSetDevice(0);

//kernel calls with pointers from the device 0

cudaSetDevice(1);

//kernel calls with pointers from device 1

//collect the results

You might also find that ArrayFire makes multi-GPU usage much easier (handles the streams & synchronization automatically for you and automatically scales to the number of GPUs in the system). Details are here.

Thanks for replay ,I still don’t know how to do it :pinch:

if it is zero copy so Idon’t have to use cudaMalloc or cudaMemcpy ,I just use

cudaHostAlloc ,cudaHostGetDevicePointer then what I should to do to make half the upper C is there then the second one ?

:wallbash:

any order in this will help me so much :unsure:

What about this: make your code working on one GPU, show us, and we’ll help you porting it to multiple GPUs. Giving hints blindly isn’t very effective.