cuda kernels from different process can run concurrently? same performance with MPS on and off?

lewisreg · April 28, 2018, 7:53am

Recently I was testing the mps feature of cuda.I intended to compare the outputs of my code with mps on and off.

__global__ void delay_kernel(unsigned seconds){

  unsigned long long dt = clock64();
  while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}

unsigned long long difft = dtime_usec(0);
  delay_kernel<<<1,1>>>(delay_t);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  difft = dtime_usec(difft);
  printf("kernel duration: %fs\n", difft/(float)USECPSEC);

I launched 5 processes to run this code, and got the following output with MPS off
kernel duration: 4,82131
kernel duration: 4,84256
kernel duration: 4,85792
kernel duration: 4,86543
kernel duration: 4,87329

Then i run the code with MPS on:
kernel duration: 4,75438
kernel duration: 4,74632
kernel duration: 4,72964
kernel duration: 4,78427
kernel duration: 4,75322

I wonder why the kernels can be executed concurrently with MPS off. Shouldn’t the kernels from different context be executed sequentially?

lewisreg · April 28, 2018, 8:47am

any comments?

Robert_Crovella · April 28, 2018, 3:00pm

Your output doesn’t conclusively indicate that the kernels are executing concurrently (in either case). If the processes fully serialize, you would get this output. Would need a lot more info. And maybe you haven’t shut off MPS correctly.

Based on the naming of your variables and functions, it looks like you’ve already looked at the article here:

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

or here:

[url]https://devtalk.nvidia.com/default/topic/887822/fail-to-launch-cuda-mps/[/url]

Would need approximately that level of information to discover what is happening. To confirm kernel concurrency, you could use the nvprof profile all processes capability:

[url]Profiler :: CUDA Toolkit Documentation

lewisreg · May 2, 2018, 9:36am

hi txbob
I solved my problem. Because I use time limitation to limit my kernel’s running time, with time-sliced scheduler, all my kernels will reach the time limitation at almost the same time, therefore they will all last 5 seconds.

But if i change the kernels to run a certain number of loops(make sure the loop will last 5 seconds), then 5 kernels will all last 25 seconds, because of time-sliced scheduler. If I turned on mps, all kernels will run concurrently, and will all stop running after 5 seconds.

Robert_Crovella · May 2, 2018, 11:15am

So you’re running this in a virtual graphics setup (i.e. GRID) ?

lewisreg · May 2, 2018, 12:16pm

No, I tested my code on my personal gtx 1080. I found that when gpu is set to DEFAULT compute mode, different processes can access gpu at the same time with a time-sliced scheduler. Therefore, I cannot use clock() function to limit my kernels running time in this case.

Robert_Crovella · May 2, 2018, 12:19pm

.

lewisreg · May 2, 2018, 12:28pm

why?
In my previous code, I use a empty while loop, with condition:

clock()-start_time<delay

then 5 kernels will stop at the same time after approximately “delay” seconds, on matter the MPS is turned on or off.

But if I change the loop condition to:

(loop_count<MAX_COUNT)

In this case, with MPS turned off, 5 kernels will all last “KERNEL_TIME * 5” seconds. But with MPS turned on, all kernels will only last “KERNEL_TIME” seconds.

So that’s why I think the time-sliced scheduler is the reason.

Robert_Crovella · May 2, 2018, 3:06pm

I stand corrected.

Yes, I would agree with you here, at least under CUDA 9.1 the behavior seems to have changed on Pascal and Volta. If you run your test case on a pre-Pascal GPU, I think you’ll still find behavior similar to what is reported here:

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

specifically referring to the different execution times for one instance vs. the other in the non-MPS case.

However if you run that same test case on a Pascal (or Volta) GPU, the kernels are running “concurrently”. To be more specific:

on pre-pascal GPUs, the scheduler (context switcher) is only switching from one GPU context to another at kernel execution boundaries
on pascal or newer GPUs, the scheduler is switching during kernel execution (ie. time-slicing, also referred to as preemption).

Sorry for the misinformation (I’ve removed some of my comments above) and thanks for pointing this out.

lewisreg · May 3, 2018, 2:03am

thanks for your help~ I understand the underlying scheduler much more better

Topic		Replies	Views
Parallelization of kernels without MPS CUDA Programming and Performance	6	745	February 5, 2019
Question about GPU sharing of Multi-process service CUDA Programming and Performance	9	6526	April 30, 2018
MULTI-PROCESS SERVICE(MPS) has no effect CUDA Programming and Performance	3	808	October 16, 2018
Understanding how threads are scheduled with and without MPS CUDA Programming and Performance	0	367	June 19, 2019
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10572	June 21, 2009
Concurrent kernel execution without stream CUDA Programming and Performance	7	2455	December 28, 2016
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3736	November 6, 2018
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
How is the laptop GPU able to do the rendering and execute a cuda program at the same time CUDA Programming and Performance	6	750	August 15, 2023

cuda kernels from different process can run concurrently? same performance with MPS on and off?

Related topics