Okay, it’s starting to be more clear. So kind of a big “bake” of the scene into grid points? Kind of like an irradiance volume, perhaps, but on the surfaces? And you want to accelerate a recompute of the data once you’ve already done it. Okay, here are some (probably stupid) ideas, see if any of this helps at all:
Would it be possible to build a per-room list of windows (and/or all portals to areas outside the room, including maybe doorways)? I wonder if it could help to estimate the total number of primary rays that will lead to secondary rays, before the render launch. If you have a list of these “portals” in polygonal form, you can compute the total solid angle of the portals from any given origin and know in advance roughly how many primary rays will leave the room and turn into secondaries. You could then be ready to pre-allocate your “dynamic” array of secondary results by using the primary ray estimate with a nice safety margin, let’s say 25% extra space for example. I’d assume your explicit list of directions is a pre-sampled hemisphere or something? All uniform? At least this would let you know you might only need space for 1k secondaries, rather than 50k of them.
If your list of 50k directions is something you put in memory and re-use for each origin, something to ponder is whether re-seeding a random number generator, or hashing based on index, and recomputing the directions on the fly might help, in order to avoid the memory bandwidth of reading from the directions buffer.
Another option is to compute primary and secondary results separately, and cache the primary results in advance. 50k rays squeezed into a bit vector is 6.25kb per origin uncompressed, where a “1” bit could represent any primary that results in a valid secondary. If you save these bits, later you could recast only the primary rays that are associated with a secondary ray that needs to be recomputed. Recasting the primary ray frees you from needing to save the hit point, and might even be as fast as saving information about the hit point into memory.
There isn’t any API for aborting launches, if you want to try that. In general, I would advise that aborting a launch should be used as a last resort, only to be used if you can’t find any other way to break your job into manageable pieces using predictable indexing. There are several ways to abort launches. The one I was thinking of when I commented above is to use your atomic counter to know when your output buffer is full. Once that happens, threads can check the counter and simply exit without doing any work. This means your remaining threads will need time to vacate. This is reasonable if you don’t expect to have the vast majority of threads needing to exit, but won’t be a great option if you launch many times more threads than needed. Another way that could work but may have complications is to use volatile pinned zero-copy memory to communicate between the GPU and host during kernel execution. This can be tricky but some people use this to abort kernels interactively. With this method, you have to use the volatile pinned memory judiciously as using it is very slow compared to normal memory loads & stores.
As far as communicating to the host goes, it is reasonable to send data back to the host. You just generally want to prefer allocating and storing the buffer on the GPU first, and then later copy it back to the host. It’s reasonable and common to do a reduction kernel of some sort on the GPU before either sending data to the host, or before using the result as input to a subsequent kernel. Some of what you’re talking about leads me to imagine several different kinds of kernels that run back-to-back processing the results of the ray tracing. See if thinking about CUDA kernels that process the results of the OptiX launch opens up some new conceptual possibilities; there are a lot of options once you start to think about doing a reduction pass on the GPU separate from the render pass, since that makes it much easier to resize and reshape your ray tracing results.