Understanding how threads are scheduled with and without MPS


So I have couple of kernels that perform operation on arrays whose size vary depending on input.
They are both single threaded. Running these experiments on 1080Ti.

One kernel (say A) is a bit longer and it takes ~5sec to complete.
The other (say B) is comparatively shorter and takes ~3sec to complete.

The above timings are calculated from host side using “gettimeval” function.

Referring this: [url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

I see that kernels are time-sliced. In that case when I run above two kernels without MPS.
I see the same results as when I run them individually.

When I run with MPS, I see the same result as when I run them individually.

My question is, why is without MPS same as with MPS. Here 5 and 3 sec are compute time needed to run the kernel, unlike experiments mentioned in stackoverflow above. To record exact launch time I again use host side function (CPU)

Kernel A:
Start Time: 1560949000340165
End Time: 1560949006120578

Kernel B:
Start Time: 1560949000350104
End Time: 1560949003426926

Time recorded in host via (in both kernels):

<ret_val> sec_func (arg1, arg2)

  1. launch_kernel_kernel
  2. return val_from_kernel

my_func ()

  1. d_time = gettimeval ()
  2. val = sec_func ()
  3. diff_t = gettimeval () - d_time
  4. print diff_t

I am expecting a round robin fashion job scheduled, so atleast 1 kernel must take longer after being launched, but that is not the case.
Is my understanding correct or is there a flaw in the way I am recording timing.

Any help is appreciated.