I manage a CUDA Driver API wrapper, and i have been trying to use cuBLAS with it, however, i ran into weird issues with contexts. I am currently doing this:
- Initialize a context using the normal cuCtxCreate
- Make a stream, make a cublas context, execute a routine, synchronize, read back, you know the gist.
This all works great, the issue is actually at the end of the function, it seems that the runtime API is… lets just say less than happy that i am destroying that context that it presumably tried to hold on to for future things. This leads to spurious segfaults, sometimes it works, sometimes it doesnt. Which is weird because the runtime API should be making a new context if the one it was using died, anyways.
So i did a little bit of digging, and i found that the driver API does in fact have something for primary context management, same mechanism as the runtime API, except there is no context stack and its one context per device.
So i had a couple of questions:
- Should i just deprecate normal driver API contexts and switch to using primary contexts, i.e. deleting the context stack too?
- If so, what do i need to do for multithreading? Can i just make a new context for every thread because its reference counted? do i need to still call cuCtxSetCurrent on every thread? (i would assume so, otherwise it will use device 0)
- Will calling retain implicitly initialize anything needed or do i need to call
cudaFree(0) or something to make sure the runtime API is initialized? if so, do i need to also link to the runtime API or can do i do it all from driver functions?
Every context you create on a device uses up space.
If it were me, I would not want to ever have more than one context per device.
If you have a library that expects to interact with an application that is using the driver API, then I would just either require a context to be explicitly handed to you, and use that only, or else have some notion of a context stack in a proper state.
If you have a library that expects to interact with an application that is using the runtime API, then I would say, simply, do nothing. Don’t create your own contexts. Expect any device state (e.g. pointers/allocations/resources handed to your library) to be valid in the primary context, and you have nothing explicitly to manage.
Multi-threading should be orthogonal to this. CUDA contexts have been shareable amongst threads for a long time now.
You should never destroy the primary context on a device if you are using or might be using the runtime API, unless you simply intend to exit right then and there or are fine with weird errors (or otherwise have a solid plan). Just because the runtime API might create a new context doesn’t mean that any previously established state will be automatically reestablished in the “new” primary context. This can lead to all sorts of hilarity for an unsuspecting library user.
I don’t really know the objectives of your wrapper (other than, I guess, to make CUDA available to Rust applications), but its not evident to me why you would start down that road on the driver API, unless there was a pre-existing notion that the thing you wanted to support was going to be creating driver API contexts and whatnot. I don’t see how you get there if Rust is your starting point. But I know very little about Rust.
The issue is more of users adding GPU acceleration to their own library and creating a context, but then another user that uses that library will also make their own context, which causes tons of issues… It is far easier to use primary contexts which will make any context creation okay.
The issue with multithreading is also not soundness, its that contexts use RAII to drop the context as soon as it goes out of scope. This works perfectly for almost everything in CUDA, except contexts, because they can be implicitly used by other things and pulling the rug under them makes stuff go kaboom.