A little help with Multi-GPU example please :) How do I pass data to each GPU?

Hi guys,

I am a Computer Engineering grad student, seeking some basic help with CUDA’s multi-GPU functionality.

The requirement is to write a CUDA app that must use both GPUs. I’ve downloaded & executed the multi-GPU example successfully. However, I’m not yet experienced enough with CUDA to understand how to pass a data structure to each GPU. I also need to get results back to the CPU side.

The following code snippets are from NVIDIA’s example. It starts with main() creating a thread per GPU:

int threadIds[MAX_CPU_THREAD];

printf("%d GPUs found\n", s_gpuCount);

CUTThread * threads = (CUTThread *)malloc(sizeof(CUTThread) * s_gpuCount);

// Start one thread for each device.

for(int i = 0; i < s_gpuCount; i++) {

        threadIds[i] = i;

        threads[i] = cutStartThread((CUT_THREADROUTINE)gpuThread, (void *)&threadIds[i]);


// Wait for all the threads to finish.

cutWaitForThreads(threads, s_gpuCount);


And a method that executes whatever kernel:

static CUT_THREADPROC gpuThread(int * device) {


       const int mem_size = NUM_BLOCKS * NUM_THREADS * sizeof(float) / s_gpuCount;

       float * idata;

        CUDA_SAFE_CALL(cudaMalloc( (void**) &idata, mem_size));

       float * odata;

        CUDA_SAFE_CALL(cudaMalloc( (void**) &odata, mem_size));

       // @@ Copy some values to the buffers.

       // Invoke kernel on this device.

        multigpu_kernel<<<NUM_BLOCKS / s_gpuCount, NUM_THREADS, NUM_THREADS*sizeof(float)>>>(idata, odata);

       // @@ Get the results back.



So for instance, if my CPU starts off with some arrays that need work done to them… how could I pass different arrays to different GPUs? The results must also end up on the CPU side eventually.

I would very much appreciate some guidance. Thank you for reading :)

  • Vash

Well, inside gpuThread, you can basically do:

if (*device == 0)

memcpy from host array1 to idata

if (*device == 1)

memcpy from host array2 to idata

and so on, but a cleaner and more generic way might be to split the array up given the amount of GPU’s and the deviceID in each thread. This is also what they do in the multiGPU example where they have an array of size n and compute a part of the array: in your case something like starting from (device(ceil(n/#gpu’s))) to ((device+1)(ceil(n/#gpu’s))). Note that this requires a check for indexing out of bounds and that the last GPU could have less work to do if n is not dividable by #gpu’s.

Hope this helps

Thanks for the reply, immo… I think you’re getting me in the right direction. I have a couple basic memory CUDA questions though:

Do the following two instructions strictly allocate video memory? :

float * idata;

CUDA_SAFE_CALL(cudaMalloc( (void**) &idata, mem_size));

And if I have the the following array of size 1000, in the CPU side:

float3 * CPU_array = new float3[1000];

for ( int i = 0; i < 1000; i++ ) {

     // fill in CPU_array with values


In your reply you mentioned memcpy to idata. In this case, what is the particular instruction that does a full memcpy from CPU_array to idata? And lastly, once the GPU has finished its processing on idata, are results sent back to the CPU via a similar memcpy? I’m assuming the example assumes such GPU results are to be stored in odata.

I think this is basically requesting some help with filling in what could be missing below the @@ comments.

Thanks again!

cudaMalloc allocates memory on GPU, and only on GPU.

cudaMemcpy(destination, source, size, cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost) do copying of input & output between CPU & GPU.

It’s not very difficult, though confusing.

The method that I generally use for multi-GPU splitting is to have the kernel wrapper discriminantly process data. I pass a pointer to the entire CPU array(s) to the wrapper, and two indeces: a start index, and the number of elements to process. This gives the wrapper a specific data range to process. The wrapper works the same way regardless of the data range, and doesn’t care about the number of GPUs in the system.


kernel_wrapper(float *idata, float *odata, size_t startIndex, size_t elements, int gpuNo)


  float* gpuIdata, gpuOdata;


cudaMalloc((void**) &gpuIdata, elements * sizeof(float));

  cudaMalloc((void**) &gpuIdata, elements * sizeof(float));

cudaMemcpy(gpuIdata, (void*)&idata[startIndex], elements*sizeof(float), cudaMemcpyHostToDevice));


cudaMemcpy((void*)&odata[startIndex], gpuOdata, elements*sizeof(float), cudaMemcpyDeviceToHost));



Of course, this may get far more complex, depending on how data is organized. If you want a real-life example, you can check my cuda project (source posted here: http://g-tech.homeserver.com/HPC.htm). I’ve implemented a more generic, but very similar scheme there.

I have different problem - with OpenGL but also with multi GPU. I want to use two GPUs separately - first for rendering, second for CUDA. Now I have sth like:

#pragma omp parallel

#pragma omp sections


#pragma omp section



			//simple CUDA computing block

			//GL instructions (looped)


#pragma omp section




			while(start==0)//waiting for pressing button



			calculateOnGPU...//data copied to GPU before start			




I’m using VBOs so there’s very small number of data transfered from main memory to first GPU. So I don’t understand why when I don’t use CUDA in second section, I have 300-400fps, but when I start CUDA - my application slows down and I have 30fps. And one important thing - data is prepared before start - I don’t use cudaMemcpy after. Maybe my idea is wrong and it’s not correct way to use two GPUs? SLI disabled of course, OpenMP enabled in project settings.

You’re assuming thet OpenGL runs on GPU0, which should generally be true, but is not guaranteed to be the case.

I runned it few times, I changed parameters from 1/0 to 0/1 - always works the same. Maybe sth don’t use only one GPU? Is possible that GL uses both cards (if SLI is disabled)? I have no idea what to do now - I’m thinking about using gpu_affinity (and trying to find sth useful), nvapi etc. but I don’t know it really can help.

Is there a way to synchronize the different pthreads and transfer data between the devices controlled by the various pthreads???