how can I use 2 GPU and split the work between them

mocha · April 8, 2012, 9:36am

hi all,

I’m trying to do matrix multiplication with two GPUs to let device 0 work for Upper half of matrix C is and device 1 for lower half of matrix C …using zero copy .

first ,I don’t know do we have to use one kernel or 2 kernel ?

second ,how can I control the upper part and the second part ?

third,do I have to use cudaMemcpyAsync() ?

External Image External Image External Image

what I did is like this

//device 0 //

cudaGetDeviceProperties(&prop, 0); 

if (!prop.canMapHostMemory) 

exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost); 

//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0); 

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);

kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);

//device 1//

cudaGetDeviceProperties(&prop, 1); 

if (!prop.canMapHostMemory) 

exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost); 

//float* a_h;

-----

-----

cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&b_h, nBytes, cudaHostAllocMapped);

cudaHostAlloc(&c_h, nBytes, cudaHostAllocMapped);

//float* a_map;

----

----

//

cudaHostGetDevicePointer(&a_map, a_h, 0); 

cudaHostGetDevicePointer(&b_map, a_h, 0);

cudaHostGetDevicePointer(&c_map, a_h, 0);

kernel<<<gridSize, blockSize>>>(a_map,b_map,c_map);

lookind foroward to some help .

Thanks

Gilles_C · April 8, 2012, 9:45am

Hi,
It that normal I don’t see any “cudaSetDevice()” in your code?

pasoleatis · April 8, 2012, 11:40am

You need use streams and the cudasetdevice to issues kernel calls on different devices.

mocha · April 8, 2012, 12:05pm

isin’t that enough to use cudaSetDevice()??

like this
///Device 0////

cudaGetDeviceProperties(&prop, 0);

if (!prop.canMapHostMemory) 
	
	exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost);

Then…

///Device 1////

cudaGetDeviceProperties(&prop, 1);

if (!prop.canMapHostMemory) 
	
	exit(0); 

cudaSetDeviceFlags(cudaDeviceMapHost);

External Image External Image

or just
cudaGetDevice(0)
do something
cudaGetDevice(2)

then how I can assign each device to do something still don’t get the idea External Image
please could any one give me the steps in order External Image .

Thank you

pasoleatis · April 8, 2012, 2:00pm

Hello,

You can so something like this:

cudaSetDevice(0);

//kernel calls with pointers from the device 0

cudaSetDevice(1);

//kernel calls with pointers from device 1

//collect the results

melonakos · April 8, 2012, 9:47pm

You might also find that ArrayFire makes multi-GPU usage much easier (handles the streams & synchronization automatically for you and automatically scales to the number of GPUs in the system). Details are here.

mocha · April 8, 2012, 11:41pm

Thanks for replay ,I still don’t know how to do it External Image

if it is zero copy so Idon’t have to use cudaMalloc or cudaMemcpy ,I just use

cudaHostAlloc ,cudaHostGetDevicePointer then what I should to do to make half the upper C is there then the second one ?

External Image

any order in this will help me so much External Image

Gilles_C · April 9, 2012, 6:09am

What about this: make your code working on one GPU, show us, and we’ll help you porting it to multiple GPUs. Giving hints blindly isn’t very effective.