Multiple CUDA contexts per device in a single process

This is partially related to another forum topic of mine (https://devtalk.nvidia.com/default/topic/931791/cuda-programming-and-performance/recoving-after-a-tdr-event/).

I have a library that I am porting from CUDA 2.3 to 7.5 and I have just discovered some significant changes in the way CUDA contexts work.

Previously my library would create one thread per device and each thread would create a context and at the end of the thread it would call cudaThreadExit(). My understanding was that this made my contexts private to my library and other parts of the same process could create their own CUDA contexts without any interference. However, I notice that cudaThreadExit() is now deprecated because its behaviour has changed and is now equivalent to cudaDeviceReset(). Thus I should avoid calling cudaThreadExit() because it will affect other parts of the same process that might be using CUDA.

I also notice that there are now primary device contexts. If my library uses the primary device contexts then this implies that I now share contexts with other parts of the same process. This seems undesirable and in particular means that my library can only set the flags if it is the first part of the process to initialise the primary device contexts.

It would appear that I can achieve something more like the old behaviour by creating non-primary device contexts using the CUDA Driver API. However, the documentation seems to specifically recommend against this:

Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used.

To what extent is performance actually degraded? Is it more the case that a few specific operations or usage patterns will incur an overhead?

I can’t quantify the performance degradation for you, but the reason that it is advised against is because CUDA activity in separate contexts cannot run on the device simultaneously. The device must context-switch between activity from each context, and this incurs overhead that is not incurred if all threads of a process are sharing the same context.

The multiple contexts per process scenario basically puts you in the same performance boat as running multiple processes on a single GPU (and without any benefit from CUDA-MPS).

Thanks, that makes the situation much clearer.