Parallel execution of kernels from different contexts on K20C

Dear forum users and NVIDIA staff,

I’ve recently got access to a Tesla K20C and inspired by the ‘Inside Kepler’ presentation from GTC 2012 decided to try the improved grid management unit.

According to slides 30 to 40 from the presentation, GK110-based GPUs are capable of executing kernels from different processes in parallel if there is enough resources.

I wrote a simple kernel that executes 200 million additions for each output item and execute this kernel on a very small grid of 1024 threads with a block size of 256. The execution takes about 3-4 seconds. When I execute two instances of this kernel in two streams of the same context, the hardware overlaps execution and it takes ~4 seconds to run the application. However, when I launch the kernel in two different contexts, the calls are serialized and the total execution time doubles. The same happens if I launch two processes each of which launches a single instance of the kernel.

Could you please clarify if overlapping of execution of kernels in different contexts/processes is actually supported on GK110? If the answer is ‘yes’, then what am I doing wrong? If the answer is ‘no’, then did I not understand the presentation or did NVIDIA fail to deliver what was planned?

Thank you

I might be wrong, but I think what you want is the simpleHyperQ example on the CUDA 5.0 SDK.

From the CUDA Samples Browser:

“This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices which provide HyperQ (SM 3.5). Devices without HyperQ (SM 2.0 and SM 3.0) will run a maximum of two kernels concurrently.”

Edit: You might already be doing that from your explanation in the first paragraph. I’m not sure about the answer to your questions in the second paragraph.

simpleHyperQ example launches kernels in the streams within one context. This works fine both in that sample and in my test. However, my question is about launching kernels in different contexts, which does not seem to work.

Just got a confirmation from NVIDIA that current hardware (including the GK110-based chips) will not be able to overlap execution of kernels from different CUDA contexts when driver API is in use neither in CUDA 5, nor in the coming release 5.5:

[i]With CUDA 5.0, the Kepler hardware (Tesla K20, GTX Titan) will not overlap kernels issued from different processes (see note below on processes vs. CUDA contexts).

With CUDA 5.5, the hardware will overlap execution when possible, if the system administrator runs a multi-process server on the node that hosts the GPU (the exact details will be included in the CUDA 5.5 documentation). This will be supported on Linux only.

On processes vs. contexts: if you are using the Runtime API, you can replace “process” above with “CUDA context” and I think I answered your exact question. However, if you are using the driver API and you are issuing kernels to the same GPU from multiple CUDA contexts within the same process, then even in CUDA 5.5 the hardware will not overlap kernels launched from those different CUDA contexts with each other.
[/i]

I’ve seen discussion of nvidia-cuda-proxy-server (and nvidia-cuda-proxy-control), which I believe is the program you use to allow multiple processes to share a CUDA context. It is is in CUDA 5.0, so I don’t know what it does now, compared to CUDA 5.5.

I too was confused by the description of HyperQ at first, so you are not alone.