Multiple CUDA contexts per device in a single process

sam_hawker · April 21, 2016, 12:52pm

This is partially related to another forum topic of mine (https://devtalk.nvidia.com/default/topic/931791/cuda-programming-and-performance/recoving-after-a-tdr-event/).

I have a library that I am porting from CUDA 2.3 to 7.5 and I have just discovered some significant changes in the way CUDA contexts work.

Previously my library would create one thread per device and each thread would create a context and at the end of the thread it would call cudaThreadExit(). My understanding was that this made my contexts private to my library and other parts of the same process could create their own CUDA contexts without any interference. However, I notice that cudaThreadExit() is now deprecated because its behaviour has changed and is now equivalent to cudaDeviceReset(). Thus I should avoid calling cudaThreadExit() because it will affect other parts of the same process that might be using CUDA.

I also notice that there are now primary device contexts. If my library uses the primary device contexts then this implies that I now share contexts with other parts of the same process. This seems undesirable and in particular means that my library can only set the flags if it is the first part of the process to initialise the primary device contexts.

It would appear that I can achieve something more like the old behaviour by creating non-primary device contexts using the CUDA Driver API. However, the documentation seems to specifically recommend against this:

Note that the use of multiple CUcontext s per device within a single process will substantially degrade performance and is strongly discouraged. Instead, it is highly recommended that the implicit one-to-one device-to-context mapping for the process provided by the CUDA Runtime API be used.

To what extent is performance actually degraded? Is it more the case that a few specific operations or usage patterns will incur an overhead?

Robert_Crovella · April 22, 2016, 3:37am

I can’t quantify the performance degradation for you, but the reason that it is advised against is because CUDA activity in separate contexts cannot run on the device simultaneously. The device must context-switch between activity from each context, and this incurs overhead that is not incurred if all threads of a process are sharing the same context.

The multiple contexts per process scenario basically puts you in the same performance boat as running multiple processes on a single GPU (and without any benefit from CUDA-MPS).

sam_hawker · April 22, 2016, 8:35am

Thanks, that makes the situation much clearer.

Topic		Replies	Views
When is it necessary to create Multiple CUcontext s per device within a single process? GPU-Accelerated Libraries	0	542	August 17, 2019
Mutilple contexts vs single context on 1 device CUDA Programming and Performance	3	832	November 28, 2012
CUDA,Context and Threading CUDA Programming and Performance	6	19979	May 29, 2012
Questions about multiple CPU threads on a single device Multiple context? CUDA Programming and Performance	1	3410	September 4, 2009
CUDA 4.0 Context Sharing by Threads Impact on existing Multi-threaded Apps CUDA Programming and Performance	8	23113	March 9, 2011
Invoking kernel from multiple PC processes CUDA Programming and Performance	1	5555	June 3, 2011
Want examples of using multiple Contexts. CUDA Programming and Performance	1	561	August 18, 2019
cuDevicePrimaryCtxRetain vs cuCtxCreate GPU-Accelerated Libraries	4	1989	August 17, 2019
Reccomended way of managing contexts in the driver API CUDA Programming and Performance	2	1225	December 25, 2021
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5612	February 3, 2010

Multiple CUDA contexts per device in a single process

Related topics