GPU sharing among different application with different CUDA context


Let’s say I have single Tesla K80 card and I have two different applications to be run on GPU. The two applications are designed to process a large chunk of data and expected to run for hours.

Now the question is if they are create and launch the CUDA kernel with different context at same time then will they be executed in serialized way i.e. till the first application finishes other will wait? Or the CUDA driver can run concurrent kernels from different applications may be using time based context switching or in spatial multiplexing mode such that both the kernel running at same time in parallel on same GPU on mutually exclusive memory and SM resources.

Please provide some insight.


1 Like

Both applications can run at the same time, however the kernels will be serialized. This assumes that the 2 applications memory and resource usage combined will fit on the same GPU.

Unless you use MPS, CUDA will not run kernels concurrently from different applications.

Even within a single process/application, however, kernel concurrency is rare/hard to witness. So this “serialization” might not make much difference.

A K80 is really 2 GPU devices in one, so it might be simpler/easier just to launch one application on one K80 device and the other app on the other K80 device. You can use CUDA_VISIBLE_DEVICES environment variable to help steer this.

Thanks for the answer.

But is it not contradictory that they will run at same time but serialized? I mean if they are serialized then how would they run at same time?
Is there some gap in my understanding?

Also as per my study it seems even with MPS, it allows for different kernels from the same applications to run concurrently if the kernels are having the same CUDA context. But if they belong to different applications then the context can not be same and the tasks are effectively serialized. Isn’t it?


I’m not able to assess your knowledge level. If we start from first principles this could be a fairly lengthy topic. What I said was the applications can run at the same time, but the kernels will serialize.

A CUDA application consists of host code and device code. The device code roughly speaking can be referred to as a set of kernel launches which run asynchronously from the host code. Therefore lets imagine we had an application that ran for about 1 second, then launched a kernel for 1 second, then did some further host processing for 1 second, then launched a kernel for 1 second, then did 1 second of host processing, then exits.

Please forgive the ascii art. The application timeline might look like this:

AAAAAAAAAAAAAAAAAAAA       (application)
    KKKK    KKKK                 (kernel)

Now suppose we launched 2 of these applications, in two different processes (e.g. two different command prompts/terminals), on a machine with a single GPU. Let’s also suppose that MPS is not in view here. In CUDA, by definition each application/process that uses a GPU will create its own context. A context can be thought of similarly to how you may think of a process space in the CPU/host system. Each host process will have a separate device context associated with it.

Coming back to our timeline, the processes might look like this:

P1: AAAAAAAAAAAAAAAAAAAA         <------------------------------------------------|
        KKKK    KKKK                   <--|          the applications run at the "same time"    
                                      the kernels serialize                       |
P2: AAAAAAAAAAAAAAAAAAAAAAAA              |        <------------------------------|
            KKKK    KKKK               <--|

Those two applications were started at the same time. My claim is from the user perspective (and from the standpoint of monitoring utilities such as top) they appear to be running at the same time. However, under the hood, the kernel launches do not run at the same time, they serialize. And this serialization will have some effect on the overall timeline (for example, in the case above, one of the 2 applications runs for 1 second longer than the other.)

So what does MPS do? In a nutshell, MPS acts as an intermediary/proxy between user applications and the GPU resources. MPS “funnels” all user application activity effectively into a single GPU context. This may not have much if any effect on the overall double-timeline above, because many, probably most CUDA kernels are written in such a way that they fully occupy the GPU anyway. If a kernel fully occupies a GPU, there really isn’t any opportunity for kernel concurrency even with MPS. MPS only makes an obvious difference when the resource utilization of the kernels in question is so small that concurrency is possible. Then, in that scenario, MPS will enable kernel concurrency that would not normally have been possible with independent user processes.

1 Like

Hi, Thanks a lot for explaining in such a detail. This is really helpful.

May be I was not very clear in drafting my question but I think you still have explained the concept and my doubt clearly.

What I understand now is that if the two applications with 1) kernel K1, CUDA context C1 and 2) kernel K2 with CUDA context C2 are running and even a single run of any of the kernel K1 or K2 takes say 10 minutes to complete the processing on GPU. Then say if kernel K1 is running then the other kernel K2 cannot be launched in parallel to say by partitioning the GPU resources in exclusive manner. Also it is still run to completion mode that is in that 10 minute of of run of K1, it cannot be preempted for time multiplexing for example 1 second of K1 then context switching and scheduling of K2 and so on just like CPU based general multitasking operating systems.

In summary i wanted to know that it is no pre-emptive scheduling and no parallel processing of different kernels at same time by partitioning GPU resources. Rather it is cooperating scheduling and I have to design my application in such a way that any kernel runs in some defined time and finishes up so the other waiting kernels can be scheduled by CUDA driver.

Do you think the approaches for concurrency are yet to come?



Please may I get some more information on last query asked?

Thanks and Regards

no, you can’t manually partition GPU resources.

its possible to use CUDA stream priority to provide a limited level of high-priority kernel processing, effectively “pre-empting” a lower priority kernel. This occurs only at the threadblock scheduling level AFAIK:

Newer GPUs like Pascal and Volta have some (so-called “thread”) pre-emption capability, but this is mostly not exposed yet for direct use in CUDA, but rather used by other CUDA features in some cases, such as cooperative groups, cuda dynamic parallelism, and debugging.

Regarding what is yet to come, I can’t predict the future for you.


Thank you very much and really appreciated the prompt reply to understand the scheduling aspects.

This is really a very good information and helped to understand the GPU concurrency a lot.


This topic very closely tracks a situation that I am trying to understand fully, so I’ll jump in here with some additional questions.
In my use case, I have multiple CPU threads accessing the GPU ‘simultaneously’. Each CPU thread has its own context, kernels, etc…
I would like each of the applications to get on/off as quickly as possible so that everyone ‘plays fair’. What I have not been able ascertain from the documentation is when the CUDA Driver API actually allows context switches.
Is it only after a kernel completes, or can context switches occur after a cuModuleLoad(), cuMemcpyHtoD/DtoH, or other such calls?

As a best practice, should I call cuCtxSetCurrent(NULL) to release my context when I’m not active - or is this even necessary?


the driver API can do a context switch any time a kernel is not running

the best practice for a multi-threaded application is not to use multiple contexts. It should be unnecessary.

I’m surprised by your best practices comment. If each CPU thread has its own modules and memory map, and they need to be persistent across multiple kernel launches, how can I use a single context?

I’m not familiar with a situation where a CPU thread has its own memory map.

No Windows or Linux process that I am familiar with works that way.

If you’re referring to a GPU memory map, the same GPU memory map can easily be shared across all threads. This is the default/usual behavior of the CUDA runtime API, and the driver API easily supports this modality with context push/pop or other methods.

Operating systems are normally designed such that a process owns resources (memory in particular), and all threads belonging to that process share those resources. Exceptions may be made for things like thread local storage (TLS).

By trivial extension, a GPU context is a resource owned by a process and shared by the threads within that process. Which jibes with txbob’s comment in #10.

Let me clarify (this is always a problem when talking about GPU/CPU)… I’m talking about multiple CPU PIDs, which each own a context on the GPU. Let’s call it a GPU/CPU context/PID pair. Each CPU PID is independent of other CPU PIDs, and each is running its own application. During initialization, the various CPU PIDs progress through cuCtxCreate(), cuMemAlloc(), cuModuleLoad(), cuModuleGetFunction(), and ultimately a bunch of cuMemcpyxxx()/cuLaunchkernel() calls. So each context/PID pair ‘owns’ a chunk of GPU memory and the kernels it wants to use to do whatever it is trying to accomplish.
My main questions concern how to ‘play fair’. I don’t want any of the CPU PIDs to hog the GPU when they don’t need to, so I want to understand the mechanism for releasing the GPU to the next context/PID pair. Is this done implicitly in that the Driver API performs a context switch whenever a kernel completes and there is another waiting in line, or should the CPU explicitly call cuCtxSetCurrent(NULL) or cuCtxPopCurrent() in order to release its hold on the GPU? What are the best practices in doing all this?

Thanks for your help!

The GPU will perform a context switch away from a GPU context that has no pending work (and no currently executing kernels) to another GPU context that has pending work automatically. You don’t need to do anything to enable this (except in the multi-process scenario be sure the GPU is in default compute mode - which is done via nvidia-smi, not anything you do in your program).

That’s just what I needed to know - thanks!
On a similar note, if I do a cuCtxSetCurrent(myContexID) followed by a cuMemCpyxxx(), will this wait for a currently running kernel attached to a different PID to complete, or will it happen asynchronously (since there is no conflict).

Hi, Sorry for asking a question…

is it possible to say that one high-priority stream in context A has higher priority than other low-priority stream in context B(different context)??
The steam priority is determined by user’s cudaStreamCreateWithPriority() call.

Thanks for your help!

stream priorities are within a context.

most defined or created entities I can think of in CUDA are unique to a particular context (including memory allocations, streams, kernels, and pointers i.e. addresses). Events may be an exception depending on how you look at it.

A GPU never runs work (kernels) from 2 or more contexts simultaneously. Therefore there is no sense of inter-context priority, and priority work in a given context does not indicate any priority condition between contexts. A GPU will context-switch away from a context, to another context, only when no kernels are running in that context. A GPU will not context switch to another context (except in preemption cases, which are not in view here) when one or more kernels are running.

The exact heuristic by which a GPU may choose to context switch is not published, AFAIK. It is safe to assume that if there are n contexts resident on a GPU, and m (less than n) of them have pending kernels, and no kernels are currently running on that GPU, that the GPU will context-switch to one of the m contexts which has available work, if that context is not already current. Apart from that, I am doubtful that there are any published specifications, or conclusions that may be drawn, about how and when a GPU may choose to context switch.

For a single application/process, even if that application is multi-threaded, a recommended practice is to use only a single context on each GPU.

This topic is very interesting to me.
But I’m very confused about concurrency after Pascal architecture which has compute-preemption (instruction-level preemption).

As far as I understand this topic, kernels are executed in serialized way even if same CUDA context.
And also CUDA stream is just overlapped (pipelined) data transfer (like cudaMemcpy) and kernel (computation).

And, in this topic
If different processes (different CUDA context) are using MPS and they have relatively low resource utilization, they can be co-resident and executed concurrently (spatial and not time-sharing).

Then, it could be said that kernels could be executed concurrently if and only if in same context(or same process with a context).

For example,
without CUDA stream

KKKK (kernel1 (compute))
kkkkkk (kernel2 (compute))

with MPS

KKKK KKKK (kernel (compute))

KKKK KKKK (kernel (compute))

I wonder my understand is correct with considering compute-preemption (instruction-level preemption).

Thanks and Regards

This is only true if the kernels are issued into separate non-default streams. That means that this:

could never happen.