Returning dynamic sized results. Not sure how to go about this

tjaenichen · July 16, 2021, 3:31pm

I can go into more detail if needed, but basically we have a launch index of (index of position to shoot from, index into list of directions to shoot). This then sets one value in the return buffer per position.

However the cast is 2 step process. After an initial ray cast a lot of the rays are not followed anymore, yet we need to re run this analysis multiple times.

What I am now looking for is a way to cache the valid position/direction information… and that’s where I am stumped.

A naive approach was an int for each possible combination. This would result in gigs of memory used, memory we likely don’t have (well, I don’t have/want to use on my dev laptop… it just sounds like a horrible idea)

To me the obvious approach would be a dynamic array and just add valid combinations to that. My understanding is that his isn’t possible. Perfectly reasonable, just a bummer in my case.

So, now I am not sure how to go about this. Somewhere in there needs to be a dynamic list of position/direction combinations. Maybe return this somehow back to the launching app during ray gen?

But from what I understand that’s not what host functions do.

dhart · July 16, 2021, 4:35pm

Hey interesting question. I’m not sure I understand it exactly so more detail might be necessary, but let’s talk about some general possibilities first.

First of all, you can generate a dynamic array of results, with the unfortunate caveat that you need to pre-allocate the maximum possible run-time size for this array. So you could, if you wanted, allow for, say, 4GB of results, and insert items until you fill it up. To add items to a queue or dynamic array, you would typically use an atomic counter, which will obviously have some performance cost.

One way to handle this would be to let your kernel fill your dynamic buffer until it’s either done or the buffer is full. If the buffer fills up, you can exit all remaining threads without doing any work, and then process your data to reduce it or save it to disk or whatever, and then re-launch your kernel to pick up where it left off. Repeat until done.

Once it’s setup that way, it becomes optional whether you really want to use atomics and a dynamic buffer for output at all. You can also index the buffer using your launch index and let it fill with both valid and invalid results. Then after the launch, sort the results and discard the invalid stuff. This implies breaking up your job into predictable bites, but on the whole is the same as before; process your job in pieces using multiple launches.

Perhaps another option might include using a spatial hash for any random number generation, which might give you the option to recompute or reconstruct some of your data rather than cache it?

Does any of that start to seem reasonable, or am I missing your point & problem completely? I also didn’t understand what you meant at the end about host functions, could you elaborate on that?

–
David.

tjaenichen · July 16, 2021, 5:18pm

Sure, happy to give more details!

Imagine a grid on the floor of a room with a resolution of 30cm. From each grid point we shoot rays in “random” directions (usually 50k rays, always the same for each origin). If the “primary” ray hits a window, we cast a new/“secondary” ray outside and then report back if after that it hits the building or nothing/the sky. Those hits are then atomicAdded for each grid point as the final result. Currently we’re not just doing this with a single room, but whole city blocks or even burroughs.

Ideally we can edit buildings outside of the rooms in real time and get instant feedback (or close to that)

In general it works, it just takes a while, too long for live editing.

If we can keep the connection between a secondary ray (the one that checks for the sky) and the origin we just need to check those secondary when there are changes. Also we can cull a lot of the primary rays to begin with, once we know which ones hit a window in subsequent calls.

I thought of this, but I only need a fraction of that in almost all cases (I just can’t really make a good guess what the ratio is). It just feels wrong to allocate 4gb just for that. Or maybe I am being unnecessarily stingy here? (Also for a few more weeks VRAM is a limiting factor)

I didn’t think of aborting launches. I checked the API a few more times but that didn’t come up. How can I manually abort a launch?

Yeah, that was one way I thought about. However ideally that information wouldn’t have to be send back to the host at this stage.

I’d have one launch to fill a buffer with all the valid/window hitting “primary” rays and then just launch against this set my secondary rays and report those results back.

I was thinking that another option might be to have a callback for each successful primary ray, that then fills a Vector on the host, which I then feed into my secondary launches.

Edit: Took me a second to get where you’re going with the abort. Sounds like a great idea!

dhart · July 16, 2021, 6:16pm

Okay, it’s starting to be more clear. So kind of a big “bake” of the scene into grid points? Kind of like an irradiance volume, perhaps, but on the surfaces? And you want to accelerate a recompute of the data once you’ve already done it. Okay, here are some (probably stupid) ideas, see if any of this helps at all:

Would it be possible to build a per-room list of windows (and/or all portals to areas outside the room, including maybe doorways)? I wonder if it could help to estimate the total number of primary rays that will lead to secondary rays, before the render launch. If you have a list of these “portals” in polygonal form, you can compute the total solid angle of the portals from any given origin and know in advance roughly how many primary rays will leave the room and turn into secondaries. You could then be ready to pre-allocate your “dynamic” array of secondary results by using the primary ray estimate with a nice safety margin, let’s say 25% extra space for example. I’d assume your explicit list of directions is a pre-sampled hemisphere or something? All uniform? At least this would let you know you might only need space for 1k secondaries, rather than 50k of them.

If your list of 50k directions is something you put in memory and re-use for each origin, something to ponder is whether re-seeding a random number generator, or hashing based on index, and recomputing the directions on the fly might help, in order to avoid the memory bandwidth of reading from the directions buffer.

Another option is to compute primary and secondary results separately, and cache the primary results in advance. 50k rays squeezed into a bit vector is 6.25kb per origin uncompressed, where a “1” bit could represent any primary that results in a valid secondary. If you save these bits, later you could recast only the primary rays that are associated with a secondary ray that needs to be recomputed. Recasting the primary ray frees you from needing to save the hit point, and might even be as fast as saving information about the hit point into memory.

There isn’t any API for aborting launches, if you want to try that. In general, I would advise that aborting a launch should be used as a last resort, only to be used if you can’t find any other way to break your job into manageable pieces using predictable indexing. There are several ways to abort launches. The one I was thinking of when I commented above is to use your atomic counter to know when your output buffer is full. Once that happens, threads can check the counter and simply exit without doing any work. This means your remaining threads will need time to vacate. This is reasonable if you don’t expect to have the vast majority of threads needing to exit, but won’t be a great option if you launch many times more threads than needed. Another way that could work but may have complications is to use volatile pinned zero-copy memory to communicate between the GPU and host during kernel execution. This can be tricky but some people use this to abort kernels interactively. With this method, you have to use the volatile pinned memory judiciously as using it is very slow compared to normal memory loads & stores.

As far as communicating to the host goes, it is reasonable to send data back to the host. You just generally want to prefer allocating and storing the buffer on the GPU first, and then later copy it back to the host. It’s reasonable and common to do a reduction kernel of some sort on the GPU before either sending data to the host, or before using the result as input to a subsequent kernel. Some of what you’re talking about leads me to imagine several different kinds of kernels that run back-to-back processing the results of the ray tracing. See if thinking about CUDA kernels that process the results of the OptiX launch opens up some new conceptual possibilities; there are a lot of options once you start to think about doing a reduction pass on the GPU separate from the render pass, since that makes it much easier to resize and reshape your ray tracing results.

–
David.

tjaenichen · July 16, 2021, 6:30pm

Thanks David!

I just remember now from testing that at least for what we do what is costly is the trace, not the launch itself. (but will test this again with a full set) So I am going to try out the launch/abort/relaunch pattern aiming for a large enough buffer. It sort of flipped around how I look at the problem.

I tried the solid angle route, but there was always some corner case where this created more problems than it was worth it.

The rays are uniform, so not actually random. I was thinking about bit fields, but thought that it would only reduce the memory load so much, while creating all this overhead with the bit operations.

As for the multiple kernels… I need to think about that.

Again, thanks a lot. This really helped :)

Thorsten

dhart · July 16, 2021, 6:45pm

Ah, uniform rays might be even cheaper to generate on the fly than random.

So FWIW with bit fields and multiple kernels, I was kind-of imagining these things going together. One option is to write initial results into a bigger buffer without using bit twiddling, then use a reduction kernel to pack the bits and compress them however you see fit, and when you go to re-render, you can extract the bits you need to connect primary rays and secondary rays. The benefit of what I’m talking about is avoiding bit twiddling during the writing phase of your first render / OptiX launch, where it would have the biggest impact and possibly require atomics. The reduction phase would be a gather so would avoid atomics, and the re-render phase would only need to unpack the bits you need, which is a few extra instructions but by virtue of being packed you get the benefits of a higher hit rate from cached memory reads.

Anyway, I do hope my rambling helped a little. Flipping how you look at the problem is kind-of the goal, until it starts to fit into something scalable. I’m slightly worried I made the kernel abort sound too good… do think about possible ways to not need to interrupt a running kernel, I don’t want to send you down a rabbit hole.

–
David.

tjaenichen · July 16, 2021, 7:13pm

Neat! Yeah, I see the multi kernel thing now. Haven’t though about running another intermediate step in between that way.

What I take from the abort thing would be to use a buffer that is large enough most of the time (in prod at least) , with the fall back option of rerunning. (and a smaller buffer for dev, smaller devices)

And as you said, it wouldn’t be an actual abort, just no ops.

tjaenichen · July 16, 2021, 8:08pm

hmm, actually the bit field should be small enough to make all of this moot or at least help a lot using the packing…

Thanks! :)