cuda kernels from different process can run concurrently? same performance with MPS on and off?

Recently I was testing the mps feature of cuda.I intended to compare the outputs of my code with mps on and off.

__global__ void delay_kernel(unsigned seconds){

  unsigned long long dt = clock64();
  while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}
unsigned long long difft = dtime_usec(0);
  delay_kernel<<<1,1>>>(delay_t);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  difft = dtime_usec(difft);
  printf("kernel duration: %fs\n", difft/(float)USECPSEC);

I launched 5 processes to run this code, and got the following output with MPS off
kernel duration: 4,82131
kernel duration: 4,84256
kernel duration: 4,85792
kernel duration: 4,86543
kernel duration: 4,87329

Then i run the code with MPS on:
kernel duration: 4,75438
kernel duration: 4,74632
kernel duration: 4,72964
kernel duration: 4,78427
kernel duration: 4,75322

I wonder why the kernels can be executed concurrently with MPS off. Shouldn’t the kernels from different context be executed sequentially?

any comments?

Your output doesn’t conclusively indicate that the kernels are executing concurrently (in either case). If the processes fully serialize, you would get this output. Would need a lot more info. And maybe you haven’t shut off MPS correctly.

Based on the naming of your variables and functions, it looks like you’ve already looked at the article here:

https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app/34711344#34711344

or here:

https://devtalk.nvidia.com/default/topic/887822/fail-to-launch-cuda-mps/

Would need approximately that level of information to discover what is happening. To confirm kernel concurrency, you could use the nvprof profile all processes capability:

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#mps-nvprof

hi txbob
I solved my problem. Because I use time limitation to limit my kernel’s running time, with time-sliced scheduler, all my kernels will reach the time limitation at almost the same time, therefore they will all last 5 seconds.

But if i change the kernels to run a certain number of loops(make sure the loop will last 5 seconds), then 5 kernels will all last 25 seconds, because of time-sliced scheduler. If I turned on mps, all kernels will run concurrently, and will all stop running after 5 seconds.

So you’re running this in a virtual graphics setup (i.e. GRID) ?

No, I tested my code on my personal gtx 1080. I found that when gpu is set to DEFAULT compute mode, different processes can access gpu at the same time with a time-sliced scheduler. Therefore, I cannot use clock() function to limit my kernels running time in this case.

.

why?
In my previous code, I use a empty while loop, with condition:

clock()-start_time<delay

then 5 kernels will stop at the same time after approximately “delay” seconds, on matter the MPS is turned on or off.

But if I change the loop condition to:

(loop_count<MAX_COUNT)

In this case, with MPS turned off, 5 kernels will all last “KERNEL_TIME * 5” seconds. But with MPS turned on, all kernels will only last “KERNEL_TIME” seconds.

So that’s why I think the time-sliced scheduler is the reason.

I stand corrected.

Yes, I would agree with you here, at least under CUDA 9.1 the behavior seems to have changed on Pascal and Volta. If you run your test case on a pre-Pascal GPU, I think you’ll still find behavior similar to what is reported here:

https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app/34711344#34711344

specifically referring to the different execution times for one instance vs. the other in the non-MPS case.

However if you run that same test case on a Pascal (or Volta) GPU, the kernels are running “concurrently”. To be more specific:

  • on pre-pascal GPUs, the scheduler (context switcher) is only switching from one GPU context to another at kernel execution boundaries
  • on pascal or newer GPUs, the scheduler is switching during kernel execution (ie. time-slicing, also referred to as preemption).

Sorry for the misinformation (I’ve removed some of my comments above) and thanks for pointing this out.

thanks for your help~ I understand the underlying scheduler much more better