Dear forum users and NVIDIA staff,
I’ve recently got access to a Tesla K20C and inspired by the ‘Inside Kepler’ presentation from GTC 2012 decided to try the improved grid management unit.
According to slides 30 to 40 from the presentation, GK110-based GPUs are capable of executing kernels from different processes in parallel if there is enough resources.
I wrote a simple kernel that executes 200 million additions for each output item and execute this kernel on a very small grid of 1024 threads with a block size of 256. The execution takes about 3-4 seconds. When I execute two instances of this kernel in two streams of the same context, the hardware overlaps execution and it takes ~4 seconds to run the application. However, when I launch the kernel in two different contexts, the calls are serialized and the total execution time doubles. The same happens if I launch two processes each of which launches a single instance of the kernel.
Could you please clarify if overlapping of execution of kernels in different contexts/processes is actually supported on GK110? If the answer is ‘yes’, then what am I doing wrong? If the answer is ‘no’, then did I not understand the presentation or did NVIDIA fail to deliver what was planned?