I am doing some dnn inferencing using onnx model and OpenCV. I’ve noticed big differences in inference speed on Docker compared to WSL2. Both are using Ubuntu 20.04.
I’ve run nvidia-smi while the application was running and I see that on Docker my GPU is not fully utilizes graphics clock?
The data doesn’t seem to be consistent to me. The section you’ve captured at the higher clock rate shows a longer period of time where the GPU is at that higher clock rate, whereas the lower clock rate section shows the GPU holding the lower clock rate for a shorter period of time. If these were comparable, I would expect the opposite - especially if this is leading to an actual increase in application runtime. So its possible you’re not comparing apples to apples here.
At any rate, to investigate an unexpected low GPU core clock rate, I would look at the clocks throttle reasons in nvidia-smi -a output for additional clues.
The data doesn’t seem to be consistent to me. The section you’ve captured at the higher clock rate shows a longer period of time where the GPU is at that higher clock rate, whereas the lower clock rate section shows the GPU holding the lower clock rate for a shorter period of time.
I am not sure about that, if for idle state we assume clock rate at 300 Mhz, then for faster solution we have 16 “ticks” of activity while for slower one we have 21 “ticks” at clock rate higher than 300 Mhz.
Running nvidia-smi -a shows:
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
I was comparing the time spent at 1875 vs. the time spent at 1005.
For the nvidia-smi -a output, you would need to check that while the application is running, in particular while the application on docker is running and while the clock is at 1005.
In my experience, 300 MHz is the GPU idle clock and 1005 MHz is the clock speed after CUDA initialization. I call that the “ready for computation” state. Clock boost only takes place if there is a sustained load on the GPU, and this does not not happen instantaneously. 1875 MHz looks close to maximum boost speed.
The above is based on my observations only. Power state and boost state transition logic are internal implementation artifacts and not publicly documented by NVIDIA. They may differ by driver version. In my observation, GPU clock remains at 1005 MHz when running an app using CUDA if (1) too much time elapses between kernel launches and (2) if kernel run time is too short, for some combination of unspecified limits “too much” and “too short”. In other words, attempts to boost the clock do not happen if the GPU isn’t under “full” load, for some definition of “full”.
If the observation was made across two physically different host systems, my hypotheses are that the system reaching boost clock of 1875 MHz either uses a faster host system (so less time elapsed between kernel launches), or a less powerful GPU (resulting in longer kernel run times), or both.
If my hypotheses turn out to be applicable, my suggestion would be to use the fastest host system available to keep the GPU well fed.
@Robert_Crovella I know what you were computing, but when clock reaches it peek it is not the whole computation time. From the metrics above it looks like GPU is under utilized in second case, but I don’t understand why. I did run nvidia-smi -a multiple times and it never showed anything suspicious.
@njuffa the thing is, host systems are running on the same hardware, on the same computer. The application I wrote was compiled with the same libraries and I moved it between host systems to test whether it is the problem in it. I’ve checked libraries it was linked with and they are the same. I can’t pinpoint the difference on the software side.
In my observation, GPU clock remains at 1005 MHz when running an app using CUDA if (1) too much time elapses between kernel launches and (2) if kernel run time is too short, for some combination of unspecified limits “too much” and “too short”. In other words, attempts to boost the clock do not happen if the GPU isn’t under “full” load, for some definition of “full”.
So this is a difference observed with same hardware plus same software, except one run if from a Docker installation the other on bare metal? If so, I have no specific ideas, but logic would appear to dictate that there must be a difference somewhere if there are reproducable performance differences on the order of 40%.
How confident are you that the Docker image is replicating the bare-metal environment exactly? Does Docker have any sort of run-time components that could create additional overhead? I have never used it, but my understanding is that Docker is just a clever packaging mechanism allowing rapid replicated deployments, i.e. no run-time components.
So this is a difference observed with same hardware plus same software, except one run if from a Docker installation the other on bare metal?
Exactly.
If so, I have no specific ideas, but logic would appear to dictate that there must be a difference somewhere if there are reproducable performance differences on the order of 40%.
Yes, and if it happened to me, it can happen to anyone. I am very curious what the difference is.
How confident are you that the Docker image is replicating the bare-metal environment exactly?
I am 100% sure that Docker is not the factor here, reasons are:
All Nvidia examples for example nbody is giving me the same output on both systems.
I have two Python virtual environments on Windows (the very same machine). In one environment I am getting the same measurements as on WSL2 and in other the same results as on Docker. The problem is that on Windows with these envs, I don’t remember what I did that they are differ. So it is easier for me to give WSL2/Docker example, because I can reproduce this every time.
I have never used it, but my understanding is that Docker is just a clever packaging mechanism allowing rapid replicated deployments, i.e. no run-time components.
It allows you to virtualize operating system while virtual machine virtualizes you hardware.
If that is the case, then the normal GPU clock management would be in place. The only reason for the GPU not to achieve peak clock is if the rate of work issuance is too low. This would suggest that the host code in the docker environment is running more slowly.
I suspect that is where the problem is. Virtualization comes with overhead. Depending on what the app is doing besides GPU work, the overhead in the host code could vary a lot. If the host code runs more slowly because of virtualization, it issues work to the GPU more slowly. That might lead to low GPU load and result in the GPU running at the base clock of 1025 MHz.
To confirm this hypothesis, I would profile the host code in detail, with particular attention on the delay between issuance of work to the GPU.
In practical terms, if your GPU support clock locking, you may want to experiment with locking in higher clocks. The flip side of that is high power consumption.
I profiled host code, the problem lies within execution graph of onnx model, particularly one node called “slice”. I’ve tried to profile this with Nvidia Nsight Compute and Systems and older profiles, but all of them are so buggy and crash every time, that it is impossible to get to the bottom of this. These profiles crash not only on docker, but on bare Linux and Windows as well.
In practical terms, if your GPU support clock locking, you may want to experiment with locking in higher clocks. The flip side of that is high power consumption.
Unfortunately not supported on regular Nvidia cards, at least not on mine RTX 2060.