Multi-GPU with a single thread and driver API?

From looking at the latest and greatest discussions regarding multi gpu and multi thread, I’m trying to figure out if it’ll be possible to create multiple context running in parallel from the same thread? I have a serialized version which works perfectly, but before seperating it further into an additional 2 threads and deal with syncing the threads, I wanted to get an idea if this is feasible.

Something like this:

cuDeviceGet(&dev,0);
cuCtxCreate(&ctx1, 0, dev);

… do cuda stuff on gpu0

cuDeviceGet(&dev,1);
cuCtxCreate(&ctx2, 0, dev);

… do cuda stuff on gpu1

destoybothCtx

Of course I might have omitted some ctx steps.

Will something like this run concurrently? Will it run at all?

This should not run at all.
You need to have several threads to work with different devices. You cannot even change device for a thread.

BarsMonster

With CUDA 2.0 that’s not true. You can now select which context is active to current thread with cuCtxPushCurrent()/cuCtxPopCurrent() (see 4.5.3.2 in Programming Manual 2.0b2).

xmonraz

Though it is possible to mange mor than one device from one thread I wouldn’t suggest to do so. Your code might easily be screwed with all this context stack management. It’s much more straightforward to have two threads each working with own device, IMO.

Thanks for that recommendation. The problem is that my computational task varies depending on the range of items I tell it to work on. To get the most utilization out of my multiple gpus, I’m thinking of making more smaller kernel calls and invoking them upon a free gpu.

Can you tell me what might screw up? I’m just trying to get an idea of what people are doing or thinking regarding these situations.

I can imagine the following being better than a multi-thread solution, but what do you guys and gals think? At this point, it doesn’t seem to be a matter of is it possible or not, rather, which is recommended/better (given the specific application, I know, I know…but you have an idea of what I have to deal with)

setup a context on each device, probably an array of them

whilte(theresworktobedone){

…get or wait for avail gpu

… push a ctx

…do some work

…pop a ctx

}

destroy all ctx

This should work, however there still may be some problems. One is GPU under-utilization if you have different GPUs installed, i.e. if you have loop like this

GPU_CTX ctx[2];

int idx = 0;

while(1) {

 CtxPush( ctx[idx] );

Synchronize( ctx[idx] );

 LaunchKernel( ctx[idx] );

CtxPop( ctx[idx] );

idx ^= 1; // switch devices

};

it will do following:

synch( ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

synch (ctx[0] );

run( ctx[0] );

synch( ctx[1] );

run( ctx[1] );

now, if device for ctx[0] is faster than device for ctx[1] (e.g. 8800GTX and 8600GT) you still be limited by performance of 8600GT. In fact, you’ll get 2x 8600GT and not 8600GT+8800GTX.

This problem can be solved by using Stream and/or Event API to check if device is busy, but I’m not really good at this ;-)

Hope this helps )

Great point. The whole reasoning for this was to eliminate gpu-underutilization. By cutting my problem into smaller kernels that still utilize the full gpu and basically running them asynchronously without synchronization was what I had it mind. Not so much iterating through the number of gpus, rather have a smart function that could tell me if something is not running on a gpu of my ‘pool’ and returning that one, or check for another one, or wait.

Thanks so much for the insight. I’m barely getting into this context managing thing myself and hopefully it’ll play out well =P