First time I call nppiResize_8u_C4R() in my C++ application takes about 20 seconds to return.
This issue only happens on Ubuntu 20.04 when running without sudo.
It does not happen on Ubuntu 18.04 (works fine with/without sudo).
It does not happen on Ubuntu 20.04 when running with sudo.
I have a C++ application which (among other things) receives H265 encoded video over a TCP socket, decodes the video, and uses OpenGL to display the decoded frames.
On Ubuntu 18.04 - the application works perfectly, and has been in use for several years already. (works both with/without sudo)
On Ubuntu 20.04 when running with sudo - it also works properly.
On Ubuntu 20.04 when running without sudo - the app takes ~50 seconds to start working properly.
When running without sudo on Ubuntu 20.04:
For the first ~50 seconds of the run, I see a bunch of weird symptoms:
The app hogs ~100% CPU
The app’s TCP RECV-Q fills up (according to netstat -noap)
The relevant issue for here - the app callsnppiResize_8u_C4R, and it takes ~20 seconds for the function to return.
After about 50 seconds, the app stabilizes, TCP Q is cleared, CPU goes down, and the video is streamed and displayed properly.
For every video frame that our application displays, we first use nppiResize_8u_C4R.
Only when running on Ubuntu 20.04 without sudo, the first call to this function takes about 20 seconds to return. The next calls are fast.
OS: Ubuntu 20.04 Focal Fossa
uname -a output:
Linux g2g 5.15.0-71-generic #78~20.04.1-Ubuntu SMP Wed Apr 19 11:26:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Two of the usual suspects are GPUs not in persistence mode and JIT-compilation delays.
Neither seems very likely to me, both can be researched and are discussed in various places. The JIT delay would be applicable if you were statically linking your app to some large CUDA library (like NPP) and you linked the app against a version of CUDA prior to 11.0 (when Ampere support appeared). In that case, the app would still run on your machine, but the initial CUDA call (e.g. to library or whatever) will result in a long delay as a bunch of kernels get compiled from PTX to SASS for your A4500. JIT compilation would also be consistent with “hogs ~100% CPU”.
That doesn’t seem likely to me because if you built your app in the environment you indicate (CUDA 11.4) then it would not be applicable. I also don’t know that sudo should matter there. (There is a JIT cache also. It might be that JIT caching is different on your setup with sudo and without.)
I don’t have any other ideas or suggestions at the moment.
The GPU was indeed in Persistence Mode: Disabled.
I now set sudo nvidia-smi -pm ENABLED, and confirmed with nvidia-smi -q that the GPU switched to Persistence Mode: Enabled.
However, it didn’t seem to make any difference, still took ~52 seconds until the app stabilized and started running properly.
sounds like jit compilation - the older GPU is an older arch that matches whatever you built.
Its just a guess, I don’t have anything further to suggest. If you indicate how you built the code (all details - the machine environment, exact compile commands, etc.) it may help to rule in or out this idea.
Yes, you have a bunch of packages related to CUDA 10. Code built against CUDA 10 cudart or CUDA 10 libraries is going to jit-compile if you try to run it on a Ampere GPU (cc8.x). The same code won’t need to jit-compile if you run it on a 1650 device (cc7.5)
You can upgrade the drivers (and may need to, at some point) but that alone won’t fix anything regarding the long delay. You’ll need to move the project forward to a CUDA 11.1 or newer base. I haven’t studied your project to learn all the dependencies, but the 8.0.14 codec sdk is pretty old (current versions are 11.x or newer). For detailed questions about video codec sdk/projects, I would ask on the video codec forum.
I don’t know of any method to confirm jit compilation. Inferentially, you could use the environment variable to disable PTX JIT. if your application uses proper error checking and failed at that point (no binary for GPU, or something like that) then you could confirm that JIT compilation is indeed necessary, but that is not a perfect view into what is going on, or whether the JIT compilation is indeed affecting the entire 50 second duration. As I stated previously, the fact that everything is fine with sudo is a conflicting data point, and suggests that even if JIT compilation is occurring, it is not the primary contributor to the observation. Or, alternatively, it may be that JIT compilation is the issue, but sudo is having some effect on the JIT caching process.
I don’t have further ideas/suggestions. The sudo data point may be instructive, but I can’t imagine all the possibilities.
It doesn’t look to me like you are doing static linking against npp, therefore, ldd on the built binary may be instructive, to see the actual npp libraries it will load at runtime.
Updating that the issue indeed seems to be related to JIT-compilation.
Specifically the JIT cache (~/.nv/) belonged to root, that’s why the issue was only seen when running without sudo.
Details @Robert_Crovella - the environment variables helped me figure it out, thanks!
(also, sorry I didn’t provide this info earlier as I was not aware of it, but our application loads PTX files, and that’s probably the root cause of the issue)
When running my app with CUDA_DISABLE_JIT=1 the app indeed failed when trying to load a PTX file with cuModuleLoadDataEx.
I realized that the result of the JIT compilation is cached in ~/.nv
This caching dir belonged to root, and thus was not accessible when running without sudo.
That’s why the app started immediately with sudo (because the ptx was loaded from the cache), but took time to start when running without sudo (because the cache was not accessible, so JIT-compilation had to be performed again in every run).
Removing ~/.nv/ so it can be re-created by unprivileged user solved the issue.
The proper solution is probably to change how compile our PTX files…
We use the following command to compile our ptx files:
I guess that we need to adapt our compilation flags to make our ptx files compatible with newer ARCHes.
Possibly we need to compile to several architectures in “fat binaries”, since we need to support both new & old GPUs. I will look into this.