Recently I was testing the mps feature of cuda.I intended to compare the outputs of my code with mps on and off.
__global__ void delay_kernel(unsigned seconds){
unsigned long long dt = clock64();
while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}
unsigned long long difft = dtime_usec(0);
delay_kernel<<<1,1>>>(delay_t);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
difft = dtime_usec(difft);
printf("kernel duration: %fs\n", difft/(float)USECPSEC);
I launched 5 processes to run this code, and got the following output with MPS off
kernel duration: 4,82131
kernel duration: 4,84256
kernel duration: 4,85792
kernel duration: 4,86543
kernel duration: 4,87329
Then i run the code with MPS on:
kernel duration: 4,75438
kernel duration: 4,74632
kernel duration: 4,72964
kernel duration: 4,78427
kernel duration: 4,75322
I wonder why the kernels can be executed concurrently with MPS off. Shouldn’t the kernels from different context be executed sequentially?