I decided to run a simple experiment, on a linux laptop (Fedora 25) with a Quadro K610M GPU, and CUDA 8.0.
I created the following code which launches a simple delay kernel once every 100ms. It continues to launch the kernel for a total of 100seconds. The initial kernel delay is set at 10ms, and after each 10 seconds, the kernel delay is increased by 10ms. So initially, we are launching a kernel once every 0.1s, and the kernel duration is 0.01s. By the end, we are launching a kernel once very 0.1s, and the kernel duration is 0.1s.
Here is the code:
$ cat t15.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__global__ void delay_kernel(unsigned long long us){
unsigned long long dt = clock64();
while (clock64() < (dt + us));
}
int main(){
// calibrate
delay_kernel<<<1,1>>>(1000);
cudaDeviceSynchronize();
unsigned long long dt = dtime_usec(0);
delay_kernel<<<1,1>>>(1000000);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
unsigned long long one_second = (1000000ULL/(float)dt)*1000000ULL;
printf("one second = %lu\n", one_second);
dt = dtime_usec(0);
delay_kernel<<<1,1>>>(one_second);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
printf("one second = %f\n", dt/(float)USECPSEC);
// run kernel loop
dt = dtime_usec(0);
int mpy = 10;
unsigned long long next = mpy*USECPSEC;
int incr = 0;
while (incr < 10){
incr++;
dt = dtime_usec(0);
unsigned long long nt = 0;
printf("%d\n", incr);
while (nt < next){
delay_kernel<<<1,1>>>(one_second*0.01*incr);
usleep(100000);
nt = dtime_usec(dt);}
}
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -o t15 t15.cu
$ ./t15
one second = 877963136
one second = 0.920732
1
2
3
4
5
6
7
8
9
10
$
The nvvp profile timeline looks like this (with mpy set to 1 instead of 10, so it runs in 10 seconds instead of 100):
Next, I started glxgears. Then I ran my test app in another terminal window while glxgears was running. Here was the (text) output from glxgears during the ~100s of my test app execution:
6905 frames in 5.0 seconds = 1380.915 FPS
5589 frames in 5.0 seconds = 1117.635 FPS
6793 frames in 5.0 seconds = 1358.460 FPS
6883 frames in 5.0 seconds = 1374.402 FPS
6242 frames in 5.0 seconds = 1247.301 FPS
6236 frames in 5.0 seconds = 1243.349 FPS
5488 frames in 5.0 seconds = 1096.961 FPS
5503 frames in 5.0 seconds = 1097.312 FPS
4892 frames in 5.0 seconds = 977.672 FPS
4883 frames in 5.0 seconds = 973.700 FPS
4074 frames in 5.0 seconds = 813.966 FPS
4082 frames in 5.0 seconds = 813.868 FPS
3392 frames in 5.0 seconds = 677.873 FPS
3417 frames in 5.0 seconds = 681.422 FPS
2645 frames in 5.0 seconds = 528.594 FPS
2610 frames in 5.0 seconds = 520.406 FPS
1947 frames in 5.0 seconds = 389.110 FPS
1974 frames in 5.0 seconds = 393.616 FPS
1213 frames in 5.0 seconds = 242.400 FPS
1212 frames in 5.0 seconds = 241.674 FPS
552 frames in 5.0 seconds = 110.305 FPS
659 frames in 5.0 seconds = 131.753 FPS
7316 frames in 5.0 seconds = 1463.083 FPS
So we see that it is fairly easy to roughly predict the performance in this particular case, based on the idea that the GPU is running either CUDA or graphics, but not both. The highest observed framerate in my case was about 1400 FPS, when there is no CUDA activity. The lowest framerate observed, about 110 FPS, less than 10% of the maximum, was when the CUDA activity is nearly continuous. Note that the CUDA kernel itself is doing next to nothing. It has almost no resource utilization (one block of one thread, which thread is making no use of CUDA resources like floating point units, or memory accesses), and yet when it is running it is effectively preventing graphics from running.
Note that this is just what I observed on my particular test case, on a fairly old GPU. You may witness something different on your Quadro P4000. The process/context scheduler on Pascal and newer architectures may implement a time-sliced inter-context scheduler rather than a round-robin scheduler. In that case, the behavioral characteristics may look different on your Quadro P4000 GPU, especially when you get to the region where the delay kernel would ordinarily be filling the timeline. This particular delay method is susceptible to varying behavior in the presence of the time-sliced scheduler. I don’t know if you would observe that or not in this case.
Also, perhaps a more interesting test case would be to launch the graphics work and the CUDA work from the same application/process.