This is partially related to another forum topic of mine (https://devtalk.nvidia.com/default/topic/931791/cuda-programming-and-performance/recoving-after-a-tdr-event/).
I have a library that I am porting from CUDA 2.3 to 7.5 and I have just discovered some significant changes in the way CUDA contexts work.
Previously my library would create one thread per device and each thread would create a context and at the end of the thread it would call cudaThreadExit(). My understanding was that this made my contexts private to my library and other parts of the same process could create their own CUDA contexts without any interference. However, I notice that cudaThreadExit() is now deprecated because its behaviour has changed and is now equivalent to cudaDeviceReset(). Thus I should avoid calling cudaThreadExit() because it will affect other parts of the same process that might be using CUDA.
I also notice that there are now primary device contexts. If my library uses the primary device contexts then this implies that I now share contexts with other parts of the same process. This seems undesirable and in particular means that my library can only set the flags if it is the first part of the process to initialise the primary device contexts.
It would appear that I can achieve something more like the old behaviour by creating non-primary device contexts using the CUDA Driver API. However, the documentation seems to specifically recommend against this:
Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used.
To what extent is performance actually degraded? Is it more the case that a few specific operations or usage patterns will incur an overhead?