PTX compilation overhead

Hi,

In our application we are having this issue where in ptxjitcompiler and takes a significant chunk of time which slows down the startup of our application. I see some varied observations with regards to platform and the graphics card in use that I am unable to explain or fix.

Linux :
We have two system one with RTX 2080TI and one with Titan Xp. On the RTX 2080TI we don’t see any calls to the libnvidiaptxjitcompiler.so. But I do see it on the TitanXp machine. Even when I change the NVCC flags to include the architecture for the Titan Xp machine I do not see any improvement. How is it that this library is not called on the RTX 2080TI. What needs to be done on the TitanXp machine to make sure that I get the same startup performance as RTX 2080TI ? They both are on the same nvidia driver version.

Windows :
It is very hard to even find the different libraries that are being called on windows to be honest. I tried Nsight System it does not detail out this information which is why I had to resort to Linux. If you can suggest a tool that can layout the different calls to the libraries would be very helpful (I used an application called FlameGraph in Linux). Back to PTX performance on windows even on RTX 2080TI I get very poor performance. Although it does get better from the second run I guess there is some kind of caching happening, but I don’t see the same performance as I see in Linux RTX 2080TI.

Some numbers that we are getting (we start and stop the application to render some amount of frames in this case 25 frames ),
Linux 2080 TI takes 0.7s
Linux TitianXp takes 5-6s
Windows 2080TI takes 5-6s for the first time and comes down to 3s from the second time we launch the same application.
It remains the same even if I try to compile the PTX files to same architecture on which the application is running.

I would highly appreciate if someone helps me understand what exactly is going on with the different platforms and different GPU architectures.
How can I get the same performance as I get in the Linux 2080TI system where in the library is never called.

On Windows you can use procexp or listdlls from the Microsoft Sysinternals utilities to see which DLLs are loaded. I see some other good answers on Stack Overflow too How do I find out which dlls an executable will load? - Stack Overflow

I usually use procexp, but it only shows currently loaded DLLs for a running process. It’s not easy to do if the process runs and exits quickly.

Which version of OptiX and which driver are you using?

I guess you might be seeing normal OptiX cache behavior. Shader programs are cached on disk after compilation, and the cache is checked before compilation, and loaded from the cache instead of compiling, if the requested shader is already in the cache. If you want to understand compilation behavior, it’s a good idea to delete your cache every time you want to replicate first-time compilation behavior. On Linux, the cache is in /var/tmp/OptixCache_<your_username>/. On Windows, the cache is in a folder with a hash in the name, you can use procexp or sysinternals’ “handle” to find the cache file called “cache7.db” (optix 7) or “cache.db” (optix 6).

If compilation time is a problem and you want to reduce it, try reviewing how much inlining you have, that’s often a culprit for slow compilation times. Using callable programs in OptiX is one feature we offer that can help with compilation times, by strategically preventing function inlining.


David.

Hi David,

Thanks for the procexp program it was helpful on Windows.
For this particular test we are using OptiX 6.5 and the driver version is 460.32 on Linux and 461.09 on Windows.
We were able to locate the Cache on both Linux and Windows.
I did see a significant difference in the compilation time of the first run when I did delete the OptixCache on both Windows and Linux. Although the time it takes on both the platforms is significantly different.
On Linux the first run takes 7s and it takes 1s from the second run onwards (I get exactly the same performance on the Titan Xp as well after deleting the OptixCache).
On Windows the same program, first run takes 12s and 4s from the second run onwards.
Both are running on the same hardware.
Why do I see this difference ?

Compilation is a host-side operation, so the time taken is a function of many things that include your operating system, CPU, available host memory, other running processes, NVIDIA driver version, and any OS-specific differences in both the compiler and the compilation. It’s pretty common in my experience to see differences in compilation time across operating systems.


David.

Thanks for the inputs, David. We will do some more internal analysis here. We do observe that a few months back on Windows we were getting similar results exactly how Linux is performing at the moment (execution times of 1s). It regressed on both Operating systems. I am glad the we have resolved the delay on Linux i.e it is able to use the Cache properly by resetting the Cache. The same trick does not help on Windows.
Much appreciated !