Utilization of SMs in a GPU

I am a CUDA newbie and looking into running multiple applications together on a GPU.
From the readings what I understand is NVIDIA does not advice running multiple applications
on the GPU (although CUDA driver manages to run them) since
a ) Applications do not have knowledge about each other’s context
b ) Architectures upto G280 do not support concurrent kernel / (application ) execution

Taking this into consideration , what happens if few Streaming Multiprocessors are not being
used if an application is already running on the GPU? Can we detect this condition and run other applications on those idle SMs?
(Although did not see this feature in the CUDA programming model)

Actually, running multiple applications has worked for a long time on every CUDA compute capability. Note that all CUDA GPUs, including Fermi, interleave execution of kernels from different contexts. At no time do kernels from different contexts (which includes different applications) run on the device simultaneously. When you run a GUI on your GPU, that also is treated as a CUDA context of sorts, which is why a long running kernel can make your display appear to freeze. Every context gets total control of all SMs when the context is active.

The reasons NVIDIA discourage multiple applications using the same GPU include:

  • Buggy drivers in the past could potentially cause crashes during frequent GPU context switching. This has been resolved, as far as I know.

  • In a multi-user system (a common use case for multi-application execution), there is no graceful degradation as users exhaust the GPU memory. If one user takes all the device memory, the second user will just see their application abort until the first user’s application exits or they free the memory.

  • The overhead of context switching means that multiple applications will see lower total performance than just one application, a lot lower if each application is executing very short kernels. This is one place where Fermi is supposed to have improved, but the overhead is still not zero.

Note that concurrent kernel execution (as NVIDIA has defined it) is not related to concurrent application execution. Concurrent kernel execution on Fermi allows kernels from different CUDA streams in the same CUDA context to execute simultaneously, getting varying amounts of SMs depending on the block scheduler. Different CUDA contexts running on the same device still undergo a full context switch between kernel executions, just as before.

Thanks seibert! this is quite helpful :). Is there some link / documentation from which I can read more about the GPU context switching and also about the second point you mentioned (about application abort because it does not find enough GPU memory )?

I’m not sure about the context switching discussion. The CUDA programming guide is the standard reference for most things. Another option is to write a small benchmark yourself and see what happens when you run multiple programs on the same GPU.

As for the abort, I wasn’t very precise. What will actually happen is cudaMalloc will return an error, and then you will have to decide how to handle the out of device memory condition. In my applications, if there is no device memory, the program can’t run, so I abort.