Performance Issue: Use both CUDA for Deep Learning and OpenGL for Rendering

My PC:

  • Xeon 3.x
  • Quadro P4000

My application:

  • Use Yolo in darknet for detection with Cuda enabled: Frame time is about 5-15ms/frame. We detect Moving object in realtime and project on that.
  • Use OpenGL to do high-performance rendering: 120FPS. Output to Projector for projection mapping.

My Issue:

  • Projection quality is much-reduced comparision with old version which I does not use DeepLearning/Cuda(just use image processing for detection)

Dear experts,
Do you think CUDA affects to OpenGL rendering performance?
It seems 15ms is too big delay if CUDA and OpenGL in queue.
I expect NVIDIA must have a parallel design to avoid this issue?
Thank you.

I suppose the OpenGL performance is affected by the CUDA kernel launches within Yolo.
See also https://stackoverflow.com/questions/24149059/running-cuda-and-opengl-in-parallel-without-using-interoperability and https://community.khronos.org/t/do-opengl-and-cuda-opencl-run-in-parallel/67507

5-15ms / frame for Yolo sounds quite fast actually. What version of Yolo (V1/V2/V3) are you using ? How many object classes, and what is the receptive field size ?

Well,
If so, it is a piece of bad news.
Nvidia should separate works, even if Command queue is ordered, it should be Async for Cuda and Graphics

It can be async, I don’t know. But even if CUDA and OpenGL work async, both wll compete for the GPU resources (SMs, memory bandwidth, …) and therefore the CUDA performance will affect the OpenGL performance and viceversa.

Thank for you sharing.
I thought this is official forumn of NVIDIA customer care? I expect an official help from NVIDIA expert…

While under an Nvidia domain, Nvidia doesn’t provide official support via this forum. A moderator might clarify this.

I have no experience with Yolo. But if you have source code, you may be able to use cudaStreamCreateWithPriority() to create the Yolo streams having lower priority.

Hello @nguyenhuongw,

This is the official community for Nvidia Developers. This does not guarantee that every forum will get support from an Nvidia employee. There are over 110 forums here within Devtalk, and having staff reply to every one of these forums 24/7 is not possible. We do our best to monitor and cover the majority of the most active forums.

Online communities, in general, rely on the members to offer up their expert advice and experience whenever possible.

Best,
Tom
Devtalk Community Manager

@tera, thank for your suggestion. I will try that.
@Tom, thank for your reply, I understood.

However, I really want to get a confirmation from NVIDIA provider about NVIDIA hardware design in order that I can change my hardware design for my system:

  1. Cuda computing time for 1 frame is 15ms
  2. Rendering time for 1 frame is 1ms

If I run them parallelly in same Graphic Card, maximum rendering performance I can get is:

1000/(15+1) = ~60FPS?

Thank you so much.

If each frame requires a pass through yolo, and yolo requires 15 ms per pass, I don’t see how it could be any different. not sure what you are expecting.

having said that, I’m not saying it is guaranteed that you should be able to get 60FPS, for whatever you are doing. I’m saying that given the information you have presented, I agree that the upper bound on performance would be (approximately) 60FPS. That strikes me as pretty obvious, however, and not in any way dependent on any knowledge of NVIDIA hardware design. You should be able to be pretty confident of that conclusion without any input from NVIDIA or anyone else, so I’m not really sure I understand your question.

@Robert
Thank for your reply.
“That strikes me as pretty obvious, however, and not in any way dependent on any knowledge of NVIDIA hardware design”
I think I forgot to focus on parallelly keyword.
It means I use 2 threads, 1st for rendering with > 120FPS, 2nd for Yolo with Cuda which take 15ms/frame.
I expect hardware flow of Cuda and Rendering can run separately, not totally separated but 70% or something else.
I think it should be possible because I also only utilize ~30% GPU if running only Yolo part.

Otherwise, if Cuda and Rendering using serial processing flow(sharing processor), it will be maximum ~60FPS.

They can run separately, but when the GPU is processing CUDA code, it is not processing graphics work, and vice versa. I think this is restating what has already been stated. The graphics pixel pipeline runs through the same hardware (streaming multiprocessors, memory interface, caches, etc) as CUDA kernels do. When those SM computational resources are processing pixels, they are not processing CUDA threads, and vice versa. There is a context switching process on the GPU which flushes one type of activity before control is turned over to the other type of activity, and vice versa.

NVIDIA could certainly design a GPU such that this kind of sharing or serialization doesn’t occur, by providing dedicated resources, but it would mean that certain hardware compute resources that could be used for graphics would never be, and likewise certain hardware graphics resources that could be used for compute would never be. I doubt anyone thinks that is a better design approach. Evidently NVIDIA does not, anyway. Certain critical processing resources are shared between compute and graphics work. Utilization of these shared resources for compute necessarily impacts graphics throughput, and vice versa.

If you have the idea that if the GPU can run purely a graphics rendering task at 120fps, then you should also be able to run a CUDA kernel with no degradation in performance (reduction in fps) of the graphics task, that is simply mistaken. You’ll need to come to grips with that.

Suppose your GPU can do pure graphics rendering using all 1000 ms in a second, at 120FPS rate. Let’s also pretend this is a trivial linear problem so that if we had only 500ms per second, for example, then we could process the same graphics workload at 60FPS.

To a first order approximation, then, if I run a yolo kernel or workload that requires, let’s say, 120ms per second (let’s say we run 8 of your 15ms yolo “kernels” per second), then that leaves 880ms for graphics. I would expect, to a first order approximation, then, that you should be able to do the same graphics task at (880/1000) * 120FPS = 105FPS. We could call that a peak theoretical number, because there is some inefficiency in the context switching. This also assumes no connection whatsoever between the graphics and compute work, no usage of CUDA/graphics interop, etc. If there is some synchronization going on, things could get worse. My guess is that your real application does some synchronization. This first order analysis makes a lot of assumptions, one of the key ones being that the shared hardware resource is the “bottleneck” and that there are no other bottlenecks in your application.

of course, if we allow the yolo task to run as fast as possible, without any governor, then the graphics processing could slow to a crawl. This is easy to demonstrate.

I decided to run a simple experiment, on a linux laptop (Fedora 25) with a Quadro K610M GPU, and CUDA 8.0.

I created the following code which launches a simple delay kernel once every 100ms. It continues to launch the kernel for a total of 100seconds. The initial kernel delay is set at 10ms, and after each 10 seconds, the kernel delay is increased by 10ms. So initially, we are launching a kernel once every 0.1s, and the kernel duration is 0.01s. By the end, we are launching a kernel once very 0.1s, and the kernel duration is 0.1s.

Here is the code:

$ cat t15.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void delay_kernel(unsigned long long us){

  unsigned long long dt = clock64();
  while (clock64() < (dt + us));
}

int main(){

  // calibrate
  delay_kernel<<<1,1>>>(1000);
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  delay_kernel<<<1,1>>>(1000000);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  unsigned long long one_second = (1000000ULL/(float)dt)*1000000ULL;
  printf("one second = %lu\n", one_second);
  dt = dtime_usec(0);
  delay_kernel<<<1,1>>>(one_second);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  printf("one second = %f\n", dt/(float)USECPSEC);
  // run kernel loop
  dt = dtime_usec(0);
  int mpy = 10;
  unsigned long long next = mpy*USECPSEC;
  int incr = 0;
  while (incr < 10){
    incr++;
    dt = dtime_usec(0);
    unsigned long long nt = 0;
    printf("%d\n", incr);
    while (nt < next){
      delay_kernel<<<1,1>>>(one_second*0.01*incr);
      usleep(100000);
      nt = dtime_usec(dt);}
  }
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -o t15 t15.cu
$ ./t15
one second = 877963136
one second = 0.920732
1
2
3
4
5
6
7
8
9
10
$

The nvvp profile timeline looks like this (with mpy set to 1 instead of 10, so it runs in 10 seconds instead of 100):

Next, I started glxgears. Then I ran my test app in another terminal window while glxgears was running. Here was the (text) output from glxgears during the ~100s of my test app execution:

6905 frames in 5.0 seconds = 1380.915 FPS
5589 frames in 5.0 seconds = 1117.635 FPS
6793 frames in 5.0 seconds = 1358.460 FPS
6883 frames in 5.0 seconds = 1374.402 FPS
6242 frames in 5.0 seconds = 1247.301 FPS
6236 frames in 5.0 seconds = 1243.349 FPS
5488 frames in 5.0 seconds = 1096.961 FPS
5503 frames in 5.0 seconds = 1097.312 FPS
4892 frames in 5.0 seconds = 977.672 FPS
4883 frames in 5.0 seconds = 973.700 FPS
4074 frames in 5.0 seconds = 813.966 FPS
4082 frames in 5.0 seconds = 813.868 FPS
3392 frames in 5.0 seconds = 677.873 FPS
3417 frames in 5.0 seconds = 681.422 FPS
2645 frames in 5.0 seconds = 528.594 FPS
2610 frames in 5.0 seconds = 520.406 FPS
1947 frames in 5.0 seconds = 389.110 FPS
1974 frames in 5.0 seconds = 393.616 FPS
1213 frames in 5.0 seconds = 242.400 FPS
1212 frames in 5.0 seconds = 241.674 FPS
552 frames in 5.0 seconds = 110.305 FPS
659 frames in 5.0 seconds = 131.753 FPS
7316 frames in 5.0 seconds = 1463.083 FPS

So we see that it is fairly easy to roughly predict the performance in this particular case, based on the idea that the GPU is running either CUDA or graphics, but not both. The highest observed framerate in my case was about 1400 FPS, when there is no CUDA activity. The lowest framerate observed, about 110 FPS, less than 10% of the maximum, was when the CUDA activity is nearly continuous. Note that the CUDA kernel itself is doing next to nothing. It has almost no resource utilization (one block of one thread, which thread is making no use of CUDA resources like floating point units, or memory accesses), and yet when it is running it is effectively preventing graphics from running.

Note that this is just what I observed on my particular test case, on a fairly old GPU. You may witness something different on your Quadro P4000. The process/context scheduler on Pascal and newer architectures may implement a time-sliced inter-context scheduler rather than a round-robin scheduler. In that case, the behavioral characteristics may look different on your Quadro P4000 GPU, especially when you get to the region where the delay kernel would ordinarily be filling the timeline. This particular delay method is susceptible to varying behavior in the presence of the time-sliced scheduler. I don’t know if you would observe that or not in this case.

Also, perhaps a more interesting test case would be to launch the graphics work and the CUDA work from the same application/process.

@Robert
Thank so much for details experiment. That helps me much to be clear.
However, as a next step, I will need to check how it affects to minimum FPS of rendering.
Because in a real-time system, average fps is not all factors of quality.

In my point of view, there is another important point. When we run CUDA and Rendering in 2 separated application, they are async. Therefore, Rendering Delay Time(request to render while GPU is busy) is dynamic.
I am curious that how smart NVIDIA schedules to reduce maximum waiting time.