Compute rays/sec for Optix Program

Hello!

I’m new to profiling Optix programs and wanted to know: What is the best way to compute the number of rays processed per second? Would this require adding some code or using a profiler? I’d appreciate any insight!

Hi @n16,

This is a good question, it’s not always straightforward or trivially easy to gather rays per second data without affecting the perf you want to measure.

Nsight tools don’t report rays per second directly. (Maybe they could! I’ll look at putting in a feature request.) You still might be able to use Nsight Systems for timing, since it isn’t very invasive, and it does report kernel timings. Nsight Compute might be able to help, but there’s no direct metric and it’s a bit more invasive and might affect perf. I do recommend getting familiar with the Nsight profiling tools and trying to use them, but in the case of computing rays per second, it’s also useful to know how to gather the data yourself.

In the past my strategy has been to run 2 different OptiX kernels back to back, each with the same render job, one specialized for high performance, and the other specialized for counting the total number of rays. I typically use the OptiX feature called ‘bound values’ for this, since it is very convenient. Bound values are launch params that look like variables but are compiled out and optimized at module creation time. So my time-specialized kernel will not count rays, and will compile out any and all debug features, and generally be optimized to go as fast as possible. My ray-count-specialized kernel will use an atomic to count the total number of calls to either optixTrace() or optixTraverse(). You only need to use an atomic and count your rays if you’re doing something complicated like stochastic path tracing. If you were to do something simple, like only cast primary rays, say a constant number per pixel, then maybe your ray count can be determined in advance at launch time and maybe you don’t need to measure the ray count. At the end, I just divide my ray count from the ray-count kernel by the time spent in the time kernel, and that is my computed rays per second. Make sense?

Because my ray counts are stochastic in the above example, and because I measure time and ray counts with 2 different runs, that means my computed rays per second metric is not 100% accurate. But with enough rays it is statistically accurate, usually much more accurate than the measurement noise. Typically I’m testing billions of rays and the ray counts are consistent from run to run to within a very small fraction of a percent (hundredths or thousandths), and so the computed rays per second is quite reliable.

A few additional notes:

  • It’s best to measure performance on a different GPU than your display GPU, if you have that option. If my machine doesn’t have a built-in integrated GPU, then I will install a small/cheap GPU for driving the monitor. If I use a single GPU for the display and for timing OptiX, then I generally see much higher variance in the timings from run to run, on the order of maybe 10%.
  • Make sure when you time your OptiX kernels to use CUDA stream events, or otherwise to be very careful with synchronizing. It’s super easy to forget that OptiX API calls are asynchronous on the host, and that timing the call only captures the host activity and not the GPU activity. It’s also similarly easy to over-synchronize and slow down the code. Using CUDA events with callbacks might at first seem like extra work, but it’s really the easiest way to get accurate results.
  • If you want to be able to compare measurements over time, for example, if you are optimizing and want to check how well your optimizations are working, then you’ll want to lock your GPU clocks before each benchmark run. Typically if you want to avoid thermal throttling, it’s best to lock your clocks to something below their peak clock rate. I currently use a graphics clock rate that is 80% of the peak rate. Thermal throttling can change your clock rate and affect your perf at any time, but especially if your renders take time, or if you do multiple benchmark runs in a row. You can lock the clocks with nvidia-smi, and don’t forget to unlock the clocks when done (i.e., write a script for profiling that locks, measures, and then unlocks.)

I hope that helps, let me know if anything in there doesn’t make sense, or if that brings up more questions.


David.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.