Hi,
I have a path tracer program which uses Optix 6.5 (similar to the path-tracer example of the SDK). Due to the complexity of the scene a lot of frames (~100000) are necessary to get a decent image (even with denoising). During raytracing the frame-rate is dropping with increasing frame number to one fourth of the original frame rate or even lower.
With GPU-Z I can see that the memory controller load is increasing with increasing frame number. But there is no change in the payload or the data which is saved or transferred to the CPU (at least I’m not aware of it). The raytracer is not using recursion. The slowdown also happens when I run the program in benchmark mode, without displaying the current result.
When I pause the raytracer to let the GPU cool down, it is proceeding with the same low frame rate after I continue the program. So it does not seem to be a temperature issue.
Any idea, why the GPU is becoming slower?
Any hints, what one could do to prevent the continous drop in the frame rate?
Unfortunately it’s not possible to say what is going on there. This shouldn’t really happen.
Could you please provide the following system configuration information:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.
What exactly do you mean with “the memory controller load is increasing with increasing frame number”.
Is more CPU or GPU memory used the more frames you render? That sounds like a memory leak inside the application then.
Do you use the OptiX C++ wrappers in your application?
Do you change any scene data and forgot to destroy the previous data? The C++ wrappers in the old OptiX API don’t do that automatically.
I assume you’re not using the denoiser per frame but only at the very end at that sub-frame count.
If not, does this also happen when not using the denoiser?
Could you please try looking at the clock rates of the GPU in frequent intervals while the performance gets slower and when running the second slow iteration?
Maybe it got stuck in a low power state.
Assuming Windows OS, the NVIDIA SMI tool is normally installed to C:\Program Files\NVIDIA Corporation\NVSMI and allows to query that with the command line: nvidia-smi.exe --query --display=CLOCK
Check the nvidia-smi manual (PDF) in that folder for many more options.
Other than that, there are multiple path tracer examples I’ve written against OptiX 5.1 (which also build under 6.5) and OptiX 7 versions.
Would you be able to verify if these behave correctly to eliminate a system dependency?
(When using the old OptiX Introduction examples with MSVS versions 2017 and newer, please set the CUDA_HOST_COMPILER CMake variable manually. The old FindCUDA.cmake is out of date in that repository.)
The OS is Windows 10 (Version 10.0.17763.973]
I have seen the “drop in frame rate effect” with a Geforce RTX 2080 super, a Geforce RTX 2080Ti and a Quadro RTX 4000.
The Geforce Display Driver is 446.14-desktop-win10-64bit-international-dch-whql.exe
It’s optix 6.5.0 and Cuda 10.1.2
and Visual Studio Professional 2017 Version 15.9.25
The GPU memory, which the program uses (~3GB, depending on the scene), stays constant. It is the GPU memory controller load, which is increasing. According to the internet it “measures how much of your total memory bandwidth is being used”.
I use Optix C++ wrappers. The scene data is not changed. The only thing that changes is the frame number, which is used for the random number generation on the GPU.
I’m not using the denoiser during the frame rate test.
I’ll do the other tests you proposed and record the GPU clock rate.
What is remarkable so far is that during the first 16000 frames the frame rate stays more or less constant. After 32000 frames the frame rate is only ~83% of the original frame rate and it is further decreasing (after 65000 frames it is ~73%, after 13100 its 58%…)
Thanks. I have not seen such a result, but I also do not render 100,000 iterations
That you need that is either due to a really complex lighting setup, a sub-optimal light transport algorithm for the job, or a bad random number generator, (or any of these together).
If this is a display driver issue, there is only the option of trying out newer display drivers first.
There are plenty of newer ones than 446.14.
That one is not even on the list of RTX 2080Ti Windows 10 64-bit drivers anymore: https://www.nvidia.com/Download/Find.aspx?lang=en-us
I’m speculating wildly, but I can imagine a couple of reasons this could be happening.
If your frames are getting more random over time for any reason, then you might be suffering from lower cache coherency (due to e.g. increasing randomness). You can use Nsight Compute to verify whether this is true or not. Capture a launch early on, then capture a launch much later and use the baselines feature to compare them, while perhaps looking at the memory workload analysis.
Another possibility is that you’re getting fragmentation in the OptiX 6.5 memory manager. This could be harder to detect, but you could have your program tear down and recreate a new OptiX context every 1000 frames or something like that. If the behavior is resolved, then you can point the finger at OptiX. If that’s the problem, then the easiest/best solution to fixing it would be use use OptiX 7.
The only thing that changes is the frame number, which is used for the random number generation on the GPU.
I agree with Detlef, this could be a reason for degrading performance. Try adding a large number to your frame seed right from the start and see if the behavior changes.
It may be worth taking a step back and asking why it’s taking 100k iterations to get decent results even with denoising. You haven’t mentioned what kind of problems you’re solving or why the scene is complex, but is a number this high really expected? This is the kind of thing I get if I use pure path tracing without next event estimation (meaning don’t sent shadow rays but wait to hit an emitter randomly.) You might be able to find orders of magnitude more speedup by investing in sampling algorithms or bidirectional light transport algorithms. Solving the slowdown here can only net a total maximum 4x improvement.