Wraps in Ray gen and how data is initially stored in the memory hierarchy

Hi! I had two questions regarding how the various optix programs are actually executed on the underlying GPU.

First, can I assume that each element launched through optixLaunch will be a separate thread running on an SM? If so, can I assume that the first 32 rays obtained through optixGetLaunchIndex() in the ray gen program will be in the first warp, and the next 32 rays obtained through optixGetLaunchIndex() will be in the second warp, and so on?

Second, what memory are used to store the launch parameters, and what memory are used to store the ray payload? I feel that OptiX abstracts the memory hierarchy away from the programmers and so optimizing for memory efficiency requires a bit of guessing where the data resides. Is there a reference that explicitly states what data stays in what memory?

Hey good questions.

So regarding optixLaunch(), every invocation of your raygen program will be a separate thread. When you specify the width and height of your 2D OptiX launch, you can expect the number of threads to be width * height. Your call to optixGetLaunchIndex() gives you an index that identifies the thread.

https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#launch-index

Note the last sentence in that section: “program execution of neighboring launch indices is not necessarily done within the same warp or block, so the application must not rely on the locality of launch indices.”

So the answer to the first part of your first question is “yes”, and to the second part of the first question, “no”. You should not assume that the first 32 rays are grouped together into the first warp. It’s hard to define what “first” means, and threads and warps in general do not execute in sequential order. OptiX automatically structures a 2D launch into tiles for efficiency, so your sequential thread ids will usually not be in scan-line order. On top of that, OptiX reserves the right to move threads during execution: “For efficiency and coherence, the NVIDIA OptiX 7 runtime—unlike CUDA kernels—allows the execution of one task, such as a single ray, to be moved at any point in time to a different lane, warp or streaming multiprocessor (SM). (See section “Kernel Focus” in the CUDA Toolkit Documentation.) Consequently, applications cannot use shared memory, synchronization, barriers, or other SM-thread-specific programming constructs in their programs supplied to OptiX.” https://raytracing-docs.nvidia.com/optix7/guide/index.html#introduction#overview

Some additional reading on raygen and threads here: https://raytracing-docs.nvidia.com/optix7/guide/index.html#ray_generation_launches#ray-generation-launches

Launch parameters are in device memory. Currently launch params are put into constant (read-only) memory for efficiency, and the launch params buffer is limited to a maximum size of 64KB.

Payload values are generally compiled into registers. If you need more space for a payload than the limited number of payload slots, you can put a pointer to memory in the payload. That usually comes with the associated indirection and memory access costs.

The Programming Guide does mention both of these in different sections, but it’s easier for me to find them quickly when I know what I’m looking for. It’s true that OptiX is abstracting this away a bit, but for good reason - in general OptiX is putting these things in the most efficient-to-access place possible.

Launch Params in constant mem: https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#7054

Payloads in registers: https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#trace


David.

Thanks! So if we can’t rely on the fact that threads with neighboring indices are in the same warp, do you have any suggestions how we, programmers, could help minimize control-flow divergence/improve locality? I get that the optix runtime will do some sort of ray grouping/bundling to reduce control-flow divergence and maximize locality, as in any ray-tracing engine, but does that mean programmers can’t really help here?

I am asking this because empirically I find that if I manually reorder the ray indices in a locality-friendly way, I get much higher performance that using a random order. So I guess that suggests that the run-time system has its limits in doing ray scheduling and programmers might help, but if programmers can’t rely on certain deterministic facts, rigorous optimizations become harder.

This is also a good question.

How are you thinking of reordering the ray indices? Do you mean camera rays, or reflection rays? How do you find the locality-friendly ordering?

If by reordering, you are talking about raygen, then OptiX is not stopping you from re-ordering your thread’s work at the beginning of raygen, you can map your launch index to camera rays any way you like. You just can’t currently re-order threads during a launch once they’re in progress.

And just to be pedantic but careful - threads and rays in general are not 1:1. Raygen is called once per thread, while closest-hit is called at most once per ray. So reordering rays and reordering threads are fairly different activities.

For hit shader coherence, there are some things you can do as a programmer. At a high level, a couple of common basic ray tracing architectures are uber-shader and wavefront. Both of these options afford you chances to reduce divergence and improve locality. With an uber-shader, the design of your material system and available materials has a big impact on how much divergence occurs in hit programs. If you can combine common sub-sections in your materials, or reduce the total number of materials, you can reduce divergence. With a wavefront approach, you might stop your launch after each path segment of a path tracer, and then sort your launch indices by material in order to regain coherence, then start a new launch to shade and trace further into the scene.


David.

I meant camera/primary rays. Basically I map launch index to camera rays such as camera rays with neighboring indices are also spatially close. I am doing something weird so camera ray with neighboring indices are not automatically in a scan line order. What I found is that manually ensuring that rays with neighboring indices are spatially close significantly improves performance.

I think I get that “threads and rays in general are not 1:1”, but not sure how to understand your explanation. Are CH, AH, IS programs executed within the same thread that executes the corresponding raygen program? Also, if in, say CH, program I call another optixTrace, that secondary ray will not spawn a new thread, right?

Thanks for the suggestions on shader coherence. I am aware of the general techniques. Do you know of a good CUDA implementation of sorting launch indices according to materials?

I meant camera/primary rays. Basically I map launch index to camera rays such as camera rays with neighboring indices are also spatially close. I am doing something weird so camera ray with neighboring indices are not automatically in a scan line order. What I found is that manually ensuring that rays with neighboring indices are spatially close significantly improves performance.

Ah, okay, so yes you can continue to do that. Just be aware that OptiX is already doing it too, for exactly the same reason. The default right now is 8x4 tiles. If you have a different arrangement you need, you might have to undo the OptiX mapping. We can help with that, or it may be fairly easy to reverse engineer. I haven’t thought about this carefully, but maybe it would work to setup your own camera mapping to intentionally put your threads in scanline order, knowing that OptiX will then tile them.

I think I get that “threads and rays in general are not 1:1”, but not sure how to understand your explanation. Are CH, AH, IS programs executed within the same thread that executes the corresponding raygen program? Also, if in, say CH, program I call another optixTrace, that secondary ray will not spawn a new thread, right?

Sorry I didn’t clarify. Rays are only spawned via optixTrace() calls. The ray traversal and any associated programs attached to the ray are called within the scope of the same thread as the raygen thread that spawned the ray(s). There is never a case where new threads are spawned. So if you only trace a single primary ray in raygen, then your rays and threads are 1:1. If you trace a path of several ray segments, or trace recursively in CH, or super-sample your pixel, then you’ll have multiple rays per thread.

Do you know of a good CUDA implementation of sorting launch indices according to materials?

I would start with Thrust or Cub and see if one of those gives you everything you need. Thrust is higher level and more generic. Cub’s a little lower level and more CUDA specific.

Thrust :: CUDA Toolkit Documentation

CUB :: CUDA Toolkit Documentation


David.

Thanks for your response, David.

Am I correct in understanding that if I launch a 2D grid then the first 8x4 indices will be group together and the next 8x4 indices will be in the second group, and so on? And when I call optixLaunch() if I get 31, that’s the last element in the first tile (4th element in the 8th row in my launch grid), not the 32nd element in the first row in my launch grid, right?

Does it mean that each 8x4 tile is a warp? I can’t help but to relate a tile to a warp because 8x4 = 32. My guess is that initially they will be in the same warp, but later the runtime might decide to regroup threads into new warps for efficiency reasons?

Also, what happens to a 1D or a 3D launch grid? Any tiling being done there?

Yes, the 8x4 tiles are grouping threads into warps, the intent is to bundle the primary rays closer together so they’re more likely to traverse the same parts of the scene and less likely to hit different geometries & materials. This is assuming that you’re mapping your launch index into a camera ray by using a straightforward linear mapping of launch indices to camera space (u,v) coordinates, for example, such that a given launch_index.y value corresponds to a scan-line of pixels. And yes, OptiX reserves the right to regroup the threads.

Great question, I think the 1D launches are not tiling, and I’m not sure about 3D, I would have to check to make sure. I do think it’s instructive to figure out how to see the sequential thread ordering on-screen, I’d recommend going through that exercise. (Hint: it’s slightly harder than it sounds because you don’t get a scalar thread id when your launch index is 2D or 3D, and so the sequential mapping can happen transparently. In the past I started by returning a solid identifiable color from raygen for only a few far-apart groups of 32 threads.) This would help you verify and visualize your primary ray coherence, and if needed it might help you design any modification you need.


David.

Thank you. Is there a reference/citation for 8x4 tiling and that a tile is a warp (initially) ?

At the moment the tiling scheme is not mentioned in the programming guide mainly because there are no programming choices to make; there is no API and it happens transparently. Also, we might change it if we find higher performance alternatives, so for that reason and all the other reasons we’ve discussed, it’s best to stick to the ‘single-ray programming model’ and not rely on the shape of a warp or make any assumptions about what two neighboring threads are doing, or even that they’re neighbors as far as the GPU is concerned.

We have talked about the tiling openly here on the forum and directly with some customers, but as far as I know, nobody has ended up needing to change it or remap it or work around it and I’m not aware of any cases of it leading to slower performance. If you do end up needing to do something different, we’d love to hear about it.


David.

Following up on this thread on memory hierarchy, in traditional CUDA the compiler would try to store local variables in registers so that we can iterate over them without going to the global memory. In OptiX, if each ray in the IS program needs to update a small amount of shared data, seems like the only way to do that is to pass the shared data as the ray payload? Then my understanding is that that limits the shared data to be 8 32-bit values, which will be stored in registers, and for larger data we would have to store them in global memory and pass pointers as ray payload, correct?

Any suggestions as to shared data across rays in the IS program? Thanks!

For IS (intersection) programs, you have payload values to communicate inputs to the intersector (via registers, usually), and you have attribute values to output data from the intersector (via registers, usually). Attributes are set in the optixReportIntersection() call, and retrieved via optixGetAttribute_n() where n is [0…7]. You can write to payload slots, but you might want to prefer using attributes (it’s cleaner, and depth order hits are handled for you when reading attributes via closest-hit).

For accessing memory, there are several ways to pass pointers into IS programs. You can pass buffer pointers via the payload, or you can access your SBT entry directly in the IS program, or you can access launch params.

What kind of shared data do you want to update in intersection? And what does ‘shared’ mean exactly? Shared with all rays in the same thread (e.g. all rays cast in raygen), or shared across all thread in the launch, or some other granularity?

Keep in mind if you want to update shared data based on hits and not misses, that can (and possibly should) be done in closest-hit or any-hit, since IS programs are called more frequently than hit programs.


David.

Oh sorry I should be more clear. I am talking about sharing data across different IS program invocations of the same camera ray. That is, there is a piece of shared data that all the intersected bounding boxes need to read and write, and that shared data is about 200KB for each camera ray.