From looking at the latest and greatest discussions regarding multi gpu and multi thread, I’m trying to figure out if it’ll be possible to create multiple context running in parallel from the same thread? I have a serialized version which works perfectly, but before seperating it further into an additional 2 threads and deal with syncing the threads, I wanted to get an idea if this is feasible.
Something like this:
…
cuDeviceGet(&dev,0);
cuCtxCreate(&ctx1, 0, dev);
… do cuda stuff on gpu0
cuDeviceGet(&dev,1);
cuCtxCreate(&ctx2, 0, dev);
… do cuda stuff on gpu1
destoybothCtx
Of course I might have omitted some ctx steps.
Will something like this run concurrently? Will it run at all?
With CUDA 2.0 that’s not true. You can now select which context is active to current thread with cuCtxPushCurrent()/cuCtxPopCurrent() (see 4.5.3.2 in Programming Manual 2.0b2).
xmonraz
Though it is possible to mange mor than one device from one thread I wouldn’t suggest to do so. Your code might easily be screwed with all this context stack management. It’s much more straightforward to have two threads each working with own device, IMO.
Thanks for that recommendation. The problem is that my computational task varies depending on the range of items I tell it to work on. To get the most utilization out of my multiple gpus, I’m thinking of making more smaller kernel calls and invoking them upon a free gpu.
Can you tell me what might screw up? I’m just trying to get an idea of what people are doing or thinking regarding these situations.
I can imagine the following being better than a multi-thread solution, but what do you guys and gals think? At this point, it doesn’t seem to be a matter of is it possible or not, rather, which is recommended/better (given the specific application, I know, I know…but you have an idea of what I have to deal with)
…
setup a context on each device, probably an array of them
This should work, however there still may be some problems. One is GPU under-utilization if you have different GPUs installed, i.e. if you have loop like this
now, if device for ctx[0] is faster than device for ctx[1] (e.g. 8800GTX and 8600GT) you still be limited by performance of 8600GT. In fact, you’ll get 2x 8600GT and not 8600GT+8800GTX.
This problem can be solved by using Stream and/or Event API to check if device is busy, but I’m not really good at this ;-)
Great point. The whole reasoning for this was to eliminate gpu-underutilization. By cutting my problem into smaller kernels that still utilize the full gpu and basically running them asynchronously without synchronization was what I had it mind. Not so much iterating through the number of gpus, rather have a smart function that could tell me if something is not running on a gpu of my ‘pool’ and returning that one, or check for another one, or wait.
Thanks so much for the insight. I’m barely getting into this context managing thing myself and hopefully it’ll play out well =P