PTX compilation overhead


I hope this is the right form to post this bug in.
I work with optix ray tracing. In our application we are having this issue where in ptxjitcompiler and takes a significant chunk of time which slows down the startup of our application. I see some varied observations with regards to platform and the graphics card in use that I am unable to explain or fix.

Linux :
We have two system one with RTX 2080TI and one with Titan Xp. On the RTX 2080TI we don’t see any calls to the But I do see it on the TitanXp machine. Even when I change the NVCC flags to include the architecture for the Titan Xp machine I do not see any improvement. How is it that this library is not called on the RTX 2080TI. What needs to be done on the TitanXp machine to make sure that I get the same startup performance as RTX 2080TI ? They both are on the same nvidia driver version.

Windows :
It is very hard to even find the different libraries that are being called on windows to be honest. I tried Nsight System it does not detail out this information which is why I had to resort to Linux. If you can suggest a tool that can layout the different calls to the libraries would be very helpful (I used an application called FlameGraph in Linux). Back to PTX performance on windows even on RTX 2080TI I get very poor performance. Although it does get better from the second run I guess there is some kind of caching happening, but I don’t see the same performance as I see in Linux RTX 2080TI.

Some numbers that we are getting (we start and stop the application to render some amount of frames in this case 25 frames ),
Linux 2080 TI takes 0.05s
Linux TitianXp takes 5-6s
Windows 2080TI takes 5-6s for the first time and comes down to 3s from the second time we launch the same application.
It remains the same even if I try to compile the PTX files to same architecture on which the application is running.

I would highly appreciate if someone helps me understand what exactly is going on with the different platforms and different GPU architectures.
How can I get the same performance as I get in the Linux 2080TI system where in the library is never called.

I am not familiar with OptiX. You may want to ask about your issue in the sub-forum for OptiX: OptiX - NVIDIA Developer Forums

Does OptiX rely on dynamic PTX code generation, i.e. some amount of JIT compilation is unavoidable? If not, to eliminate online compilation, code can be built entirely offline by producing a fat binary that contains machine code for each of the GPU architectures that one intends to target (see CUDA documentation and specifically the -gencode switch of nvcc). For the TitanXp, this is sm_61 and for the RTX 2080 Ti, it is sm_75.

JIT compilation speed is a function of the following hardware characteristics: (1) single thread CPU performance (2) system memory performance (3) mass storage performance. So for optimal compilation performance you would want a host system with a high CPU base frequency (> 3.5 GHz), as many channels of fast DDR4-2666 or faster speed grade as you can afford (suggest four or more), and an SSD.

Compilation speed will also vary with optimization level (lower optimization level → faster compilation), but lowering optimization level would generally be counter-productive to application performance level.

For Optix, the PTX file/string is bound to program objects. So to answer your question I definitely need PTX code generation.
The CPU hardware characteristics are the same for both the TitanXp and RTX 2080TI. The RTX 2080TI linux does not call the JIT compiler which is peculiar. It would be nice to have similar behavior on all of the other platforms and gpu architecture.

I am sure there is a rational explanation, I just don’t know what it is since I have no idea about OptiX. I know it exists and does something with ray tracing. Did you inquire in the OptiX forums? It may be something as simple as configuration setting that needs to be adjusted.

I have now, if I do get a solution I will link it here.