Multi-GPU - Some questions

I would like to use Multi-GPU, my questions:

  1. Do I have to use c++/cpp? And what about “cudaSetDevice”?
  2. Can I simulate devices in the debug-mode?

Thanks in advance!

  1. Not really…you can write your kernels in C, then call them from any language that has a wrapper (e.g. Python via PyCUDA). This applies for any number of devices.

  2. AFAIK, debug-emulation mode only supports a single device at this time.

profquail, thanks for your answer!

But in the “simpleMultiFPU” example, I found things like “solverThread”. Where can I find documentation about it?

Take a look at MrAnderson42’s GPUWorker class (search the forum, and go to the last page in the thread for the most recent link), as that should probably give you a good starting point to work with multiple GPU’s.

  1. You can use C for sure.

  2. I think so. There are some multi-GPU sample programs in cuda SDK package.

profquail, thanks for your reply, but MrAnderson42 links are no longer available?

Axida, the problem is that it seems that you need special techniques in order to be able to implement multi-GPU. For instance in the multi-GPU example you can find:

static CUT_THREADPROC solverThread(TGPUplan *plan){

What is exactly “solverThread” in this case, what is “TGPUplan”, etc.?

Unfortunately, I can’t always access my cuda-computer(Where I might check the Cuda Source Code?). And these multi-GPU are not documented anywhere?

Multi-GPU should not be as hard as it seems. You first need to fully grasp how to manage “simple” CPU threads (without GPU in them)

once you understand how to create/open/manage/pass data back&forth/wait for events/… using any CPU threading model (pthreads, windows,…)

you can start writing Multi GPU code.

The idea is that each CPU host thread will attach to a different GPU device - using cudaSetDevice with a unique device id for each CPU thread.

All data that you want to allocate and manipulate on the GPU should be created in the context of the CPU thread that is attached to its appropriate GPU device.

If you have two GPUs:

  1. Open two “simple” CPU host - each thread gets a unique Id.

  2. On each CPU thread do what you want with the GPU ( cudaMalloc,… )

  3. Run the GPU kernel from both CPU threads.

  4. Copy result back from the GPU to the appropriate CPU thread.

Specificaly “solverThread” is the CPU host thread and the TGPUplan is just a data structure per thread with the required data (for example the GPU device

ID that the current CPU thread works with)

Hope that helps…


Thanks eyal!

Unfortunately, I still can’t access Cuda. So maybe you could correct me if I am wrong:

  1. I write my “cu” file like usually.
  2. I put the launch routine “<<<…>>>” into a “thread”, like “solverThread” in the multi-GPU example.
  3. I call it with “cutStartThread”.
  4. Then I wait for the results with “cutWaitForThreads”.
  5. And then I analyze the results?

Also, is this slower than single-GPU?
And is there a documentation for “cutil”?


Basically this is the flow (of course you should not use cutil code as its not production level :) )

Anyway, at least for my code, the GPUs are running the same code but operate on different input dataset and therefore return different output.

This is the reason for the same CPU code for all GPU devices.

The only difference is that each CPU thread have to know on which GPU it works. Here’s some pseudo code:

// This is the main thread code...

int iDeviceCount = GetDeviceCount();   // How many GPUs you have in the system.

for ( int iThread = 0; iThreadCount < iDeviceCount; iThreadCount++ )


#ifdef LINUX

	pthread_create( ...., WorkerThread, (void*)this );   // Where Worker thread will call the kernel...


	CWinThread *worker_thread = AfxBeginThread( WorkerThread, this );


	pThread->SetDeviceId( iThread );   // Assign a different GPU device to the different CPU threads - the code is a bit more complicated but this is the logic...


WaitForAllThreadsToBeOver();   // Wait for all the threads....

// This is the actuall WorkerThread (or SolverThread in the SDK )

UINT GMultiThreadHandler::WorkerThread( LPVOID data )



   ThreadData *pThread = (ThreadData *)data;	// This should be the GPUPlan in the SDK.

	// Call cudaSetDevice ONCE per CPU host....

   cudaSetDevice( pThread->GetDeviceId() );

// Do as much GPU calculation as you want on the current thread


   MyKernel<<< ....>>> ( .... );



Hope this helps some more.

As for the speed question - obviously if things are ok 2 GPUs should run faster than one. This depends on the code quality ( :) ) and on the type

of problem you try to solve. If most of the time is being “wasted” by the PCI overhead - 2 GPUs will probably not work twice as fast as one…


The new link is on the last page of the thread:…st&p=597186

Thanks to all for your answers!

Unfortunately, I wasn’t lucky. As the linker didn’t find an object or something?

But I was able to create multiple threads with “pocess.h”!(Though, not Cuda-related)

Therefore my question: Do I need Cuda’s “multithreading.h” for Multi-GPU or can I use “process.h”?

Since I can only rarely access my Cuda Computer and because I don’t have multiple GPUs, I would be really happy if you could help me with this problem!