Task scheduling in OptiX 7

Hi,

I am currently investigating whether a path tracer using OptiX 7 can benefit from techniques like regeneration (Path Regeneration for Interactive Path Tracing) or stream compaction (https://dl.acm.org/doi/10.1145/2018323.2018330). These techniques address the problem of threads being idle in a warp, in the situation where each thread computes a single path. For example, when Russian Roulette is used not all paths have an equal length. This results in warps containing both idle threads (done with the path) and active threads (not done with the path). The idle threads waste resources.

The OptiX Programming Guide states:

For efficiency and coherence, the NVIDIA OptiX 7 runtime—unlike CUDA kernels—allows the execution of one task, such as a single ray, to be moved at any point in time to a different lane, warp or streaming multiprocessor (SM). […] Consequently, applications cannot use shared memory, synchronization, barriers, or other SM-thread-specific programming constructs in their programs supplied to OptiX.

The techniques I mention are designed for architectures that schedule entire warps at a time. Reading the statement from the programming guide, it seems to me that OptiX 7 may schedule threads independently. If this is the case, do I correctly understand that the problem the techniques address is solved by the OptiX 7? And is there perhaps a resource that describes the scheduling of threads/warps in OptiX 7 in more detail?

Thanks in advance,

Nol

OptiX uses a single ray programming model.

And is there perhaps a resource that describes the scheduling of threads/warps in OptiX 7 in more detail?

No. The scheduling is completely internal and confidential and it changes among OptiX versions or even among GPU architectures.

You can do whatever you want with native CUDA in between OptiX launches but once you launched, there is nothing to be done about the individual ray scheduling.

Reading the statement from the programming guide, it seems to me that OptiX 7 may schedule threads independently. If this is the case, do I correctly understand that the problem the techniques address is solved by the OptiX 7?

The underlying ray tracing driver component tries its best.

The paper you cited is 10 years old and talked about streaming multiprocessors only. That still applies in that case, but note that with hardware ray tracing functionality inside the RTX boards, the amount of rays you can shoot is that high (>10 GRays/sec max today) that you will be solely limited by memory accesses in anything you do in between.

Occupancy can still be an issue. Since many light transport algorithms will quickly shoot rather divergent rays, there are potential optimizations which would try to reduce the divergence.
You could also shoot more rays in a specific launch index when you find that the path length is below some threshold (path depth or time) to reduce the average path length per launch index. But that would require additional management for the number of samples, or a pool of more rays you can pull from along with the necessary scattered writes into the result buffer.
In any case, you can only do that in the boundaries of that single ray programming model. Nothing is known about neighboring rays. The OptiX programming guide is perfectly clear about that.

You might be interested in Jacco Bikker’s work.

His open source Lighthouse2 implements a wavefront pathtracer using Optix7.2

Thanks for your reply, that mostly clears it up for me. I have one question about what you say here:

You could also shoot more rays in a specific launch index when you find that the path length is below some threshold (path depth or time) to reduce the average path length per launch index. But that would require additional management for the number of samples, or a pool of more rays you can pull from along with the necessary scattered writes into the result buffer.

Do you mean with this that it would be beneficial if all launch indices take the same amount of time or cast the same amount of rays? If so, it is not clear to me why this is beneficial with a single ray programming model.

That addresses the issues of the paths with long tails inside a path tracer. Since everything is done in warps under the hood, keeping threads working will increase the occupancy.

Lets say you have a long path which terminates after 6 bounces and you have one which terminates after 3, then the latter could potentially shoot another path of a different sub-frame which might also take only 3 bounces and therefore would keep the thread occupied longer.
That would have calculated two samples per pixel where the longer path had only one, which means to get to a consistent sample per pixel count you would need to track that count as well and fill the ones which are under-sampled somehow, which in turn means more memory accesses which is the enemy of performance.
The issue with this is that this needs to happen with some global heuristic or based on the clocks spent so far, since, as said, no information is available about neighboring rays.

Another approach would be to pick new rays from other pixels from a pool, that means, for example, when rendering a 3840x2160 image you don’t actually launch with the full size but with a smaller 2D launch e.g. like a quarter, and each thread would pick rays from the list of remaining rays as soon as it’s done with one path. That would require atomics. Shorter paths would pick more rays.
(I doubt that will be faster than just launching with the full size and letting the implementation schedule that internally.)

The described wavefront approach is slightly different. There you would also shoot fewer rays per launch and depending on the live state of the ray, you would either keep launching with the additional path segment or replace it with another not yet handled path.
That would require an analysis step between each launch of the wavefront which will exactly run into the memory bandwidth limits I described earlier.
The question is always “Is it worth it?”, and esp. on RTX boards you need to keep the RT cores busy which works better when using the shader pipepline and not stop after each wavefront.
(When reading that article, note that OptiX Prime has been discontinued in OptiX 7 for that reason.)

Another approach is to shoot all samples per pixels at once and work over the image in tiles like many final frame renderers do. That improves the locality of the paths per pixel, means the path length should be relatively similar per tile.
In praxis my experiments showed that RTX boards handled full screen launches similarly well though.
Means the underlying scheduler is pretty sophisticated already, so all of this would need to be determined on a case by case basis.

Divergence hurts quite a lot and it makes sense to sort rays into buckets with similar directions, for example, into octants based on their direction component signs.

Mind that you cannot do arbitrarily much work per launch since under Windows with GPUs in WDDM mode that has a 2 second Timeout Detection and Recovery (TDR) mechanism. Not so on GPUs dedicated to compute work, running the Tesla Compute Cluster (TCC) driver mode (not available on GeForce).

1 Like

Excellent answer, thank you for taking the time to explain this.