Concurrent kernel execution without stream

gakky1667 · December 20, 2016, 2:37am

Hello,

I’m trying to run the concurrent kernel on nvvp. My installation is the following:

Ubuntu 14.04 + CUDA 7.0 + Nvidia Driver [375.20] + GTX780

I tried to run 5 tasks by following commands because I wanted to verify the behavior when multiple kernels were run at the same time.

$ mpirun -np 5 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./sumArraysOnGPU-timer
$nvvp
// file > import > *.nvprof

These tasks are same code referring by CUDA_examples/sumArraysOnGPU-timer.cu at master · welcheb/CUDA_examples · GitHub but process ID is different.

I want to know how does the GPU handle multiple kernels?
Does the GPU do exclusive control of the kernel?
If GPU can run multiple kernels without using stream simultaneously, is it running like a Round-Robin?

Because I confirmed that the variation of kernel execution time became large when multiple kernels were launched for a single GPU.

And I visualized the kernels by nvcc.
As a result, launched kernels were not processed one by one but were run at the same time.

Robert_Crovella · December 20, 2016, 5:27am

What you are launching is multiple processes.

If kernels emanate from separate processes, they cannot run concurrently unless CUDA MPS is used.
If kernels emanate from the same process, they cannot run concurrently unless they are launched into separate (non-null) streams.

Once you have satisfied the above requirements, there are other “requirements” to actually witness concurrent kernel execution.

You can read more about it in the asynchronous concurrent execution section of the programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

and also by studying the CUDA concurrent kernels sample code:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#concurrent-kernels[/url]

To witness concurrent kernels from separate processes, you may wish to read this:

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

gakky1667 · December 21, 2016, 6:30am

Hi txbob,

Thank you for the detailed explanation.
I understood that kernels from separate processes do not run concurrently.

But I have two things what I do not understand yet.

I got a profiling result as following by NVCC.
https://postimg.org/image/nny48ev3v/
It seemed that the kernel running at the same time.

And I measured kernels execution time.

/*
unsigned long long dtime_usec(unsigned long long start){
  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
} */
 unsigned long long difft = dtime_usec(0);
sumArraysOnGPU<<<grid,block>>>(d_A,d_B,d_C,nElem);  
cudaDeviceSynchronize();
  difft = dtime_usec(difft);

Results almost correspond with the result of NVCC.
I learned what kernels from separate process actually run one by one by your comment.
Is the result of NVCC incorrect?

I examine context switch of processes using GPU by trace and kernelshark.

sudo trace-cmd record -e sched_switch
kernelshark

https://postimg.org/image/r8tzrmznf/

To make execution of each kernel easier to see, usleep() are inserted after each kernel.
One kernel’s execution time is short and repeat 5 times.
Another is long execution time and run one time.

I can see that the short kernel is running the kernel without waiting for other kernel processing.
So, what is going on?

Robert_Crovella · December 21, 2016, 2:32pm

I’m not sure what you are confused about. The fact that a kernel is taking twice as long to execute as it normally should in this case is indicative that it is waiting for something else to complete. That “something else” is a kernel launched from another process.

If you are asking why doesn’t nvvp (not NVCC as you said several times) clearly delineate that the waiting kernel is waiting, not executing, I’m not sure the reason for that. The difference may be invisible when viewed from the standpoint of a single process. Nevertheless I think the underlying behavior (kernels from separate processes do not run concurrently) is pretty clear.

gakky1667 · December 22, 2016, 5:57am

Hi,

It’s mean that a kernel must wait for completing another kernel launched from another process even though the kernel has finished processing.
Why does a kernel wait for completing another kernel?

Robert_Crovella · December 22, 2016, 2:47pm

No, it’s not waiting for another kernel to complete after it has already finished processing. It is waiting for another kernel to complete before it can start processing.

Fundamentally, the kernel is waiting for another kernel, because kernels from separate processes will not execute concurrently. They will serialize (unless you use CUDA MPS; even then, you must meet various requirements for concurrent kernel execution).

gakky1667 · December 28, 2016, 6:30am

If the kernel serializes, as for case B in the following figure, does either of the two kernels take twice as long to execute and the another take to execute as it normally?

https://postimg.org/image/nny48ev3v/

Why are both execution times doubled?

Robert_Crovella · December 28, 2016, 2:55pm

I don’t know. It may be that in the context-switching scenario that is involved here, the signalling of the completion of the kernel is delayed due to context switching. But that is just a guess. It may also be an artifact of the profiler, but I would discount that idea based on timing measurement.

Topic		Replies	Views
Debug on Ubuntu - concurrent kernel execution Nsight Eclipse Edition	6	2017	December 20, 2016
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3002	January 19, 2018
8x GPU app profiles parallel GPU kernel exec in NVVP, but kernels exec serial from cmd line CUDA Programming and Performance	5	561	September 15, 2020
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	533	November 16, 2017
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3616	November 6, 2018
My streams are not running concurrently CUDA Programming and Performance	7	1740	March 6, 2018
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2382	April 29, 2019
Problematic multi GPU execution CUDA Programming and Performance	6	1978	June 12, 2012
How to run nvshmemx_uint64_wait_until_on_stream concurrently? GPU-Accelerated Libraries nvshmem	1	233	April 8, 2024

Concurrent kernel execution without stream

Related topics