Hi,
So I have couple of kernels that perform operation on arrays whose size vary depending on input.
They are both single threaded. Running these experiments on 1080Ti.
One kernel (say A) is a bit longer and it takes ~5sec to complete.
The other (say B) is comparatively shorter and takes ~3sec to complete.
The above timings are calculated from host side using “gettimeval” function.
Referring this: [url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow
I see that kernels are time-sliced. In that case when I run above two kernels without MPS.
I see the same results as when I run them individually.
When I run with MPS, I see the same result as when I run them individually.
My question is, why is without MPS same as with MPS. Here 5 and 3 sec are compute time needed to run the kernel, unlike experiments mentioned in stackoverflow above. To record exact launch time I again use host side function (CPU)
Kernel A:
Start Time: 1560949000340165
End Time: 1560949006120578
Kernel B:
Start Time: 1560949000350104
End Time: 1560949003426926
Time recorded in host via (in both kernels):
<ret_val> sec_func (arg1, arg2)
- launch_kernel_kernel
- return val_from_kernel
my_func ()
- d_time = gettimeval ()
- val = sec_func ()
- diff_t = gettimeval () - d_time
- print diff_t
I am expecting a round robin fashion job scheduled, so atleast 1 kernel must take longer after being launched, but that is not the case.
Is my understanding correct or is there a flaw in the way I am recording timing.
Any help is appreciated.
Thanks