Multi-GPU with a single thread and driver API?

xmonraz · July 23, 2008, 12:59am

From looking at the latest and greatest discussions regarding multi gpu and multi thread, I’m trying to figure out if it’ll be possible to create multiple context running in parallel from the same thread? I have a serialized version which works perfectly, but before seperating it further into an additional 2 threads and deal with syncing the threads, I wanted to get an idea if this is feasible.

Something like this:
…
cuDeviceGet(&dev,0);
cuCtxCreate(&ctx1, 0, dev);

… do cuda stuff on gpu0

cuDeviceGet(&dev,1);
cuCtxCreate(&ctx2, 0, dev);

… do cuda stuff on gpu1

destoybothCtx

Of course I might have omitted some ctx steps.

Will something like this run concurrently? Will it run at all?

BarsMonster · July 23, 2008, 4:02am

This should not run at all.
You need to have several threads to work with different devices. You cannot even change device for a thread.

AndreiB · July 23, 2008, 6:05am

BarsMonster

With CUDA 2.0 that’s not true. You can now select which context is active to current thread with cuCtxPushCurrent()/cuCtxPopCurrent() (see 4.5.3.2 in Programming Manual 2.0b2).

xmonraz

Though it is possible to mange mor than one device from one thread I wouldn’t suggest to do so. Your code might easily be screwed with all this context stack management. It’s much more straightforward to have two threads each working with own device, IMO.

xmonraz · July 24, 2008, 7:55pm

Thanks for that recommendation. The problem is that my computational task varies depending on the range of items I tell it to work on. To get the most utilization out of my multiple gpus, I’m thinking of making more smaller kernel calls and invoking them upon a free gpu.

Can you tell me what might screw up? I’m just trying to get an idea of what people are doing or thinking regarding these situations.

I can imagine the following being better than a multi-thread solution, but what do you guys and gals think? At this point, it doesn’t seem to be a matter of is it possible or not, rather, which is recommended/better (given the specific application, I know, I know…but you have an idea of what I have to deal with)

…

setup a context on each device, probably an array of them

whilte(theresworktobedone){

…get or wait for avail gpu

… push a ctx

…do some work

…pop a ctx

}

destroy all ctx

AndreiB · July 25, 2008, 9:42am

This should work, however there still may be some problems. One is GPU under-utilization if you have different GPUs installed, i.e. if you have loop like this

GPU_CTX ctx[2];

int idx = 0;

while(1) {

 CtxPush( ctx[idx] );

Synchronize( ctx[idx] );

 LaunchKernel( ctx[idx] );

CtxPop( ctx[idx] );

idx ^= 1; // switch devices

};

it will do following:

synch( ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

synch (ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

now, if device for ctx[0] is faster than device for ctx[1] (e.g. 8800GTX and 8600GT) you still be limited by performance of 8600GT. In fact, you’ll get 2x 8600GT and not 8600GT+8800GTX.

This problem can be solved by using Stream and/or Event API to check if device is busy, but I’m not really good at this ;-)

Hope this helps )

xmonraz · July 25, 2008, 5:00pm

This should work, however there still may be some problems. One is GPU under-utilization if you have different GPUs installed, i.e. if you have loop like this
GPU_CTX ctx[2];

int idx = 0;

while(1) {

 CtxPush( ctx[idx] );

Synchronize( ctx[idx] );

 LaunchKernel( ctx[idx] );

CtxPop( ctx[idx] );

idx ^= 1; // switch devices

};
it will do following:

synch( ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

synch (ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

now, if device for ctx[0] is faster than device for ctx[1] (e.g. 8800GTX and 8600GT) you still be limited by performance of 8600GT. In fact, you’ll get 2x 8600GT and not 8600GT+8800GTX.

This problem can be solved by using Stream and/or Event API to check if device is busy, but I’m not really good at this ;-)

Hope this helps )

[snapback]415754[/snapback]

Great point. The whole reasoning for this was to eliminate gpu-underutilization. By cutting my problem into smaller kernels that still utilize the full gpu and basically running them asynchronously without synchronization was what I had it mind. Not so much iterating through the number of gpus, rather have a smart function that could tell me if something is not running on a gpu of my ‘pool’ and returning that one, or check for another one, or wait.

Thanks so much for the insight. I’m barely getting into this context managing thing myself and hopefully it’ll play out well =P

Topic		Replies	Views
Managing multiple GPUs from a single host thread CUDA Programming and Performance	1	1209	October 10, 2010
Multiple GPU computing CUDA Programming and Performance	8	7878	May 7, 2008
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4856	April 8, 2009
CUDA,Context and Threading CUDA Programming and Performance	6	19446	May 29, 2012
Using multiple GPU devices from a single host thread CUDA Programming and Performance	1	854	November 7, 2010
Multiple host thread on a single GPU CUDA Programming and Performance	2	5191	February 10, 2012
multi-GPU parallel operation CUDA Programming and Performance	4	4030	May 1, 2008
Multi-GPU, MPI or threads? best choice for my multi-GPU solution? CUDA Programming and Performance	11	13010	February 16, 2011
Questions about multiple CPU threads on a single device Multiple context? CUDA Programming and Performance	1	3330	September 4, 2009
CUDA 4.0 Context Sharing by Threads Impact on existing Multi-threaded Apps CUDA Programming and Performance	8	22907	March 9, 2011

Multi-GPU with a single thread and driver API?

Related topics