Concurrent kernel execution without stream


I’m trying to run the concurrent kernel on nvvp. My installation is the following:

Ubuntu 14.04 + CUDA 7.0 + Nvidia Driver [375.20] + GTX780

I tried to run 5 tasks by following commands because I wanted to verify the behavior when multiple kernels were run at the same time.

$ mpirun -np 5 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./sumArraysOnGPU-timer
// file > import > *.nvprof

These tasks are same code referring by but process ID is different.

I want to know how does the GPU handle multiple kernels?
Does the GPU do exclusive control of the kernel?
If GPU can run multiple kernels without using stream simultaneously, is it running like a Round-Robin?

Because I confirmed that the variation of kernel execution time became large when multiple kernels were launched for a single GPU.

And I visualized the kernels by nvcc.
As a result, launched kernels were not processed one by one but were run at the same time.

What you are launching is multiple processes.

If kernels emanate from separate processes, they cannot run concurrently unless CUDA MPS is used.
If kernels emanate from the same process, they cannot run concurrently unless they are launched into separate (non-null) streams.

Once you have satisfied the above requirements, there are other “requirements” to actually witness concurrent kernel execution.

You can read more about it in the asynchronous concurrent execution section of the programming guide:

and also by studying the CUDA concurrent kernels sample code:

To witness concurrent kernels from separate processes, you may wish to read this:

Hi txbob,

Thank you for the detailed explanation.
I understood that kernels from separate processes do not run concurrently.

But I have two things what I do not understand yet.

  1. I got a profiling result as following by NVCC.
    It seemed that the kernel running at the same time.

And I measured kernels execution time.

unsigned long long dtime_usec(unsigned long long start){
  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
} */
 unsigned long long difft = dtime_usec(0);
  difft = dtime_usec(difft);

Results almost correspond with the result of NVCC.
I learned what kernels from separate process actually run one by one by your comment.
Is the result of NVCC incorrect?

  1. I examine context switch of processes using GPU by trace and kernelshark.
sudo trace-cmd record -e sched_switch

To make execution of each kernel easier to see, usleep() are inserted after each kernel.
One kernel’s execution time is short and repeat 5 times.
Another is long execution time and run one time.

I can see that the short kernel is running the kernel without waiting for other kernel processing.
So, what is going on?

I’m not sure what you are confused about. The fact that a kernel is taking twice as long to execute as it normally should in this case is indicative that it is waiting for something else to complete. That “something else” is a kernel launched from another process.

If you are asking why doesn’t nvvp (not NVCC as you said several times) clearly delineate that the waiting kernel is waiting, not executing, I’m not sure the reason for that. The difference may be invisible when viewed from the standpoint of a single process. Nevertheless I think the underlying behavior (kernels from separate processes do not run concurrently) is pretty clear.


It’s mean that a kernel must wait for completing another kernel launched from another process even though the kernel has finished processing.
Why does a kernel wait for completing another kernel?

No, it’s not waiting for another kernel to complete after it has already finished processing. It is waiting for another kernel to complete before it can start processing.

Fundamentally, the kernel is waiting for another kernel, because kernels from separate processes will not execute concurrently. They will serialize (unless you use CUDA MPS; even then, you must meet various requirements for concurrent kernel execution).

If the kernel serializes, as for case B in the following figure, does either of the two kernels take twice as long to execute and the another take to execute as it normally?

Why are both execution times doubled?

I don’t know. It may be that in the context-switching scenario that is involved here, the signalling of the completion of the kernel is delayed due to context switching. But that is just a guess. It may also be an artifact of the profiler, but I would discount that idea based on timing measurement.