how the use of various video cards in parallel?
With cuda driver api this is possible, I am not sure if c/c++/cuda c/c++ has anything special for it.
It basically comes down to using multiple devices, multiple contexts, one context per device.
Either you use one cpu thread and switch between contexts or you could multi-thread it and have one context per thread. (The cuda driver api is supposed to be thread-safe I think… not sure, haven’t tested that yet External Image External Image
So suppose on thread then:
EnterContext // probably push api
// use device, call api’s, launch kernels, etc, it will all be done on the current context.
LeaveContext // probably pop api
EnterAnotherContext
// use other device
LeaveAnotherContext
The difficulty is probably with dividing work among devices, this is probably problem/algorithm specific.
It might be easier if kernels would launch over multiple devices.
Perhaps a deviceIdx and deviceDim could be added, perhaps this already exists, but I haven’t seen it, and it’s not in document so it probably don’t exist.
Devices probably also have their own memory space, but a solution might be unification/unified addresses, which is a new feature only available on tesla cards…
So “unified addressess” are probably a thing for the future for consumer cards if it ever gets that functionality.
I think heat could be issue, so I am not betting on a multi-device future for now External Image :)
You need to visit the good old multi-GPU thread in which Mr.Anderson (Mr. Then : Dr. now) showed a beautiful way to multi-GPU programming by using multiple-threads.
Although CUDA 4.0 brings in so many new stuff, the old thread is still relevant and elegant.
http://forums.nvidia.com/index.php?showtopic=66598&st=0
Best REgards,
Sarnath
Code not available from that thread.
Elegant is subjective here… if you want to be able to simply call api calls for any context then yes it would be elegant but also overheadish… all the binds are probably pushing/popping contexts all the time…
That was my initial idea too, but it gets annoying after a while, for two reasons:
- Either the api wrappers have to do bind-related-stuff which makes their design more complex and also more overheadish.
or
- Saddle the user with all that stuff: gpu0.call(bind(…))
^ Having to write that for each call is kinda annoying… and not my goal of my framework, which is low overhead and productivity/less typing.
Also for me it seems best to start with low level api and my recommendation for now would be, code for one device at a time.
Also if it’s multi-threaded then binding all the time is ofcourse not needed, just once like in my example above.
Finally if programmer does want to call api’s for any device, he/she could also wrap the context stuff himself/herself External Image
I even went as far as include critical sections into enter context/remove context… but I ended up removing that… critical sections not needed for single threaded programming, and it also made the lower level driver api wrapper too complex for my taste…
For now I don’t think such features are needed, I also like to give programmer of framework more control over his own critical sections… you wanna do multi threading ? → get your own critical sections and secure it lol :) Not so difficult… my device object could provide a critical section just so it doesn’t need to be created… but even creation code not that much…
High level frameworks could be created which deal with all of that though… but then everything becomes a bit more vague…
CUDA 4.0 handles all of this for you now anyway. Contexts are thread-safe and can be used by multiple threads at the same time.
Perhaps you mean “cuda runtime api 4.0”.
As far as I know the “cuda driver api 4.0” still requires manual use of context switching/threading and all of that.
I have seen no documentation which would make me believe otherwise External Image
Euhm… I just looked at my code again… for something else… the critical sections are still inside the cuda context wrappers… which is probably a good thing… this allows multi threading and multi device support in one go.
I did remove the context stuff from module loading and such… that’s where I thought some complexity was unnecessary.
So enter context, load modules, leave context.
It’s nice to have one entry point and one exit point for context… so that all other api’s and frameworks can simply use that a context is present… no further switching required…