A little help with Multi-GPU example please :) How do I pass data to each GPU?

Vashren · June 23, 2008, 8:17pm

Hi guys,

I am a Computer Engineering grad student, seeking some basic help with CUDA’s multi-GPU functionality.

The requirement is to write a CUDA app that must use both GPUs. I’ve downloaded & executed the multi-GPU example successfully. However, I’m not yet experienced enough with CUDA to understand how to pass a data structure to each GPU. I also need to get results back to the CPU side.

The following code snippets are from NVIDIA’s example. It starts with main() creating a thread per GPU:

int threadIds[MAX_CPU_THREAD];

printf("%d GPUs found\n", s_gpuCount);

CUTThread * threads = (CUTThread *)malloc(sizeof(CUTThread) * s_gpuCount);

// Start one thread for each device.

for(int i = 0; i < s_gpuCount; i++) {

        threadIds[i] = i;

        threads[i] = cutStartThread((CUT_THREADROUTINE)gpuThread, (void *)&threadIds[i]);

}

// Wait for all the threads to finish.

cutWaitForThreads(threads, s_gpuCount);

free(threads);

And a method that executes whatever kernel:

static CUT_THREADPROC gpuThread(int * device) {

        CUDA_SAFE_CALL(cudaSetDevice(*device));

       const int mem_size = NUM_BLOCKS * NUM_THREADS * sizeof(float) / s_gpuCount;

       float * idata;

        CUDA_SAFE_CALL(cudaMalloc( (void**) &idata, mem_size));

       float * odata;

        CUDA_SAFE_CALL(cudaMalloc( (void**) &odata, mem_size));

       // @@ Copy some values to the buffers.

       // Invoke kernel on this device.

        multigpu_kernel<<<NUM_BLOCKS / s_gpuCount, NUM_THREADS, NUM_THREADS*sizeof(float)>>>(idata, odata);

       // @@ Get the results back.

       CUT_THREADEND;

}

So for instance, if my CPU starts off with some arrays that need work done to them… how could I pass different arrays to different GPUs? The results must also end up on the CPU side eventually.

I would very much appreciate some guidance. Thank you for reading :)

Vash

immo · June 24, 2008, 7:24am

Hi guys,

I am a Computer Engineering grad student, seeking some basic help with CUDA’s multi-GPU functionality.

The requirement is to write a CUDA app that must use both GPUs. I’ve downloaded & executed the multi-GPU example successfully. However, I’m not yet experienced enough with CUDA to understand how to pass a data structure to each GPU. I also need to get results back to the CPU side.

The following code snippets are from NVIDIA’s example. It starts with main() creating a thread per GPU:
int threadIds[MAX_CPU_THREAD];

printf("%d GPUs found\n", s_gpuCount);

CUTThread * threads = (CUTThread *)malloc(sizeof(CUTThread) * s_gpuCount);

// Start one thread for each device.

for(int i = 0; i < s_gpuCount; i++) {

 Â  Â  Â  Â threadIds[i] = i;

 Â  Â  Â  Â threads[i] = cutStartThread((CUT_THREADROUTINE)gpuThread, (void *)&threadIds[i]);

}

// Wait for all the threads to finish.

cutWaitForThreads(threads, s_gpuCount);

free(threads);
And a method that executes whatever kernel:
static CUT_THREADPROC gpuThread(int * device) {

 Â  Â  Â  Â CUDA_SAFE_CALL(cudaSetDevice(*device));

Â  Â  Â  Â const int mem_size = NUM_BLOCKS * NUM_THREADS * sizeof(float) / s_gpuCount;

Â  Â  Â  Â float * idata;

 Â  Â  Â  Â CUDA_SAFE_CALL(cudaMalloc( (void**) &idata, mem_size));

Â  Â  Â  Â float * odata;

 Â  Â  Â  Â CUDA_SAFE_CALL(cudaMalloc( (void**) &odata, mem_size));

Â  Â  Â  Â // @@ Copy some values to the buffers.

Â  Â  Â  Â // Invoke kernel on this device.

 Â  Â  Â  Â multigpu_kernel<<<NUM_BLOCKS / s_gpuCount, NUM_THREADS, NUM_THREADS*sizeof(float)>>>(idata, odata);

Â  Â  Â  Â // @@ Get the results back.

Â  Â  Â  Â CUT_THREADEND;

}
So for instance, if my CPU starts off with some arrays that need work done to them… how could I pass different arrays to different GPUs? The results must also end up on the CPU side eventually.

I would very much appreciate some guidance. Thank you for reading :)

Vash

[snapback]398680[/snapback]

Well, inside gpuThread, you can basically do:

if (*device == 0)

memcpy from host array1 to idata

if (*device == 1)

memcpy from host array2 to idata

and so on, but a cleaner and more generic way might be to split the array up given the amount of GPU’s and the deviceID in each thread. This is also what they do in the multiGPU example where they have an array of size n and compute a part of the array: in your case something like starting from (device(ceil(n/gpu’s))) to ((device+1)(ceil(n/gpu’s))). Note that this requires a check for indexing out of bounds and that the last GPU could have less work to do if n is not dividable by gpu’s.

Hope this helps

Vashren · June 24, 2008, 7:00pm

Thanks for the reply, immo… I think you’re getting me in the right direction. I have a couple basic memory CUDA questions though:

Do the following two instructions strictly allocate video memory? :

float * idata;

CUDA_SAFE_CALL(cudaMalloc( (void**) &idata, mem_size));

And if I have the the following array of size 1000, in the CPU side:

float3 * CPU_array = new float3[1000];

for ( int i = 0; i < 1000; i++ ) {

     // fill in CPU_array with values

}

In your reply you mentioned memcpy to idata. In this case, what is the particular instruction that does a full memcpy from CPU_array to idata? And lastly, once the GPU has finished its processing on idata, are results sent back to the CPU via a similar memcpy? I’m assuming the example assumes such GPU results are to be stored in odata.

I think this is basically requesting some help with filling in what could be missing below the @@ comments.

Thanks again!

E.D_Riedijk · June 24, 2008, 7:52pm

cudaMalloc allocates memory on GPU, and only on GPU.

cudaMemcpy(destination, source, size, cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost) do copying of input & output between CPU & GPU.

Mr_Nuke · March 11, 2009, 10:01am

It’s not very difficult, though confusing.

The method that I generally use for multi-GPU splitting is to have the kernel wrapper discriminantly process data. I pass a pointer to the entire CPU array(s) to the wrapper, and two indeces: a start index, and the number of elements to process. This gives the wrapper a specific data range to process. The wrapper works the same way regardless of the data range, and doesn’t care about the number of GPUs in the system.

[codebox]

kernel_wrapper(float *idata, float *odata, size_t startIndex, size_t elements, int gpuNo)

{

  float* gpuIdata, gpuOdata;

  cudaSetDevice(gpuNo);

cudaMalloc((void**) &gpuIdata, elements * sizeof(float));

  cudaMalloc((void**) &gpuIdata, elements * sizeof(float));

cudaMemcpy(gpuIdata, (void*)&idata[startIndex], elements*sizeof(float), cudaMemcpyHostToDevice));

Kernel<<>>(params);

cudaMemcpy((void*)&odata[startIndex], gpuOdata, elements*sizeof(float), cudaMemcpyDeviceToHost));

}

[/codebox]

Of course, this may get far more complex, depending on how data is organized. If you want a real-life example, you can check my cuda project (source posted here: http://g-tech.homeserver.com/HPC.htm). I’ve implemented a more generic, but very similar scheme there.

_OskaR · January 25, 2010, 7:48pm

I have different problem - with OpenGL but also with multi GPU. I want to use two GPUs separately - first for rendering, second for CUDA. Now I have sth like:

#pragma omp parallel

#pragma omp sections

	{

#pragma omp section

		{

			cutilSafeCall(cudaSetDevice(0));

			//simple CUDA computing block

			//GL instructions (looped)

		}

#pragma omp section

		{

			cutilSafeCall(cudaSetDevice(1));

			cudaMemcpy...

			while(start==0)//waiting for pressing button

			{

			}

			calculateOnGPU...//data copied to GPU before start			

			//CUDA

		}

	}

I’m using VBOs so there’s very small number of data transfered from main memory to first GPU. So I don’t understand why when I don’t use CUDA in second section, I have 300-400fps, but when I start CUDA - my application slows down and I have 30fps. And one important thing - data is prepared before start - I don’t use cudaMemcpy after. Maybe my idea is wrong and it’s not correct way to use two GPUs? SLI disabled of course, OpenMP enabled in project settings.

Mr_Nuke · January 25, 2010, 8:36pm

You’re assuming thet OpenGL runs on GPU0, which should generally be true, but is not guaranteed to be the case.

_OskaR · January 25, 2010, 8:55pm

I runned it few times, I changed parameters from 1/0 to 0/1 - always works the same. Maybe sth don’t use only one GPU? Is possible that GL uses both cards (if SLI is disabled)? I have no idea what to do now - I’m thinking about using gpu_affinity (and trying to find sth useful), nvapi etc. but I don’t know it really can help.

timgr · March 4, 2012, 1:34pm

Is there a way to synchronize the different pthreads and transfer data between the devices controlled by the various pthreads???

Topic		Replies	Views
Simple multiGPU - Why is it failed Example to understand how multiGPU work CUDA Programming and Performance	8	4344	March 6, 2008
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14007	September 12, 2015
MultiGPU start help CUDA Programming and Performance	8	10523	August 10, 2010
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
Multi stream multi GPU CUDA Programming and Performance cuda	9	1136	October 6, 2023
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
streams in Multi-gpu system CUDA Programming and Performance	7	6060	May 23, 2017
GTX295 multi GPU programming CUDA Programming and Performance	22	10657	July 9, 2009
Questions for multiple GPUs CUDA Programming and Performance	8	7163	April 20, 2009
wrong results when using Cuda functions on multiple GPUs CUDA Programming and Performance	0	394	March 12, 2020

A little help with Multi-GPU example please :) How do I pass data to each GPU?

Related topics