Splitting work on multiple launches

I’m using Optix on a Windows7 system, and I’m aware of the display driver timeout problem.
Not wanting to use multiple graphics cards or changing the operating system, I was limited to the
‘split your launch’ approach.
But what is the correct way to split a single launch?

Using SampleScene/GLUTDisplay to render my images, a single launch approach would look like this:

void Scene::trace( ... ) {
     //other stuff like camera changes
     m_context->launch( entry_point, width, height);

Trying to split it up I first did something like this:

void Scene::trace( ... ) {
     //other stuff like camera changes
     //pretend 'launches' is a vector filled with individual regions for each launch
     for (unsigned int i = 0; i < launches.size(), i++) {
          m_context->launch( entry_point, launches[i].width, launches[i].height);

When that did not work out (still killed the display driver), I tried something that simply had to work in my mind:

void Scene::trace( ... ) {
     //other stuff like camera change
     m_current_launch = m_current_launch % launches.size();
     m_context->launch( entry_point, launches[m_current_launch].width, launches[m_current_launch].height);

Basically this draws one tile after another which seems to work at first glance, but still crashes the display driver when the overall workload is increased. Even decreasing the size of each launch call to an unreasonable small amount does not work (besides the overhead of these launches being ridiculous at that point).

To avoid some confusion:

  • said launch takes up 98%+ of processing time each time trace is called
  • launches take more time on well lit tiles (expected)
  • rendering the whole image is possible if overall workload is small
  • current goal is to render the Sponza scene with about 50k VPLs per trace call (does not seem over the top to me)

The crash does not happen right away, but on specific tiles.
Those tiles are very well lit so they might take longer than others, but is there some other reason than a timeout that could cause the display driver to crash (stuff like division by 0 etc)?

Is my approach valid to split the workload?
Would the first approach suffice or do i need the second one?
Am i missing something?
Some people refered to rtContextSetTimeoutCallback() but I dont know if that is needed and how its used (what is the callback function supposed to do when its called?).

What’s your GPU system setup?

for (unsigned int i = 0; i < launches.size())
Is that a copy-paste error or an actual infinite loop in your code?

current goal is to render the Sponza scene with about 50k VPLs per trace call
VPL means Vertical Plane Launch?
The important question is how many rays does that spawn per launch?

How did you setup the scene? Which acceleration structure builder did you use?

Other ways to split work would be to calculate results iteratively. That would need the ray generation program code and all places calling rtTrace() for analysis if that would be feasible.

Launch sizes should be factors above the number of threads a GPU can handle to run efficiently.

For more explanation on the timeout callback, please read this post https://devtalk.nvidia.com/default/topic/759073/?comment=4255248
The callback function should return zero if you want the launch to be continued. See OptiX API Reference on RTtimeoutcallback type.
But as long as you cannot reduce the time it takes to calculate the most expensive single thread to finish below the Windows WDDM timeout, you’ll run into this timeout problem on a board responsible for the display no matter how small your launch size is.

“What’s your GPU system setup?”
Currently not at work, will let you know.

“Is that a copy-paste error or an actual infinite loop in your code?”
Thx for pointing out the loop header, was just a copy paste error, fixed it.

“VPL means Vertical Plane Launch?”
VPLs means virtual point lights.

“The important question is how many rays does that spawn per launch?”
I want to place about 50k of these virtual point lights in the scene, this is done in a seperate launch, completing in a matter of milliseconds.
In the next step the actual rendering is done by shooting simple camera rays into the scene and calculating the light by iterating over all VPLs (per camera ray).
As occlusion testing is currently done via shadow rays, the amount of rays per pixel sample is almost equal to the amount of VPLs used.
I have tried to do this with just one sample per pixel and varying launch sizes (tried all powers of two from 512x512 down to 4x4) and always got a timeout.

“How did you setup the scene? Which acceleration structure builder did you use?”
I loaded a wavefront obj with an ‘ObjLoader’ into my ‘MeshScene’,
using Sbvh for building and bvh as traversal method.

“Other ways to split work would be to calculate results iteratively.”
I am adding the results of multiple launches together in an iterative approach,
rays are also launched iteratively (both similar to how its done in the pathtracer sample).
I guess that is what you meant. The only sensible way to reduce processing time per launch any further
seemed to be to redruce the launch size.
One more thing I should add here is that I can easily reduce the amount of VPLs per launch and it renders fine no matter what launch size being used, but launches with a lot of VPLs are more interesting for my research at the moment.

"But as long as you cannot reduce the time it takes to calculate… "
So there might be no way to render this amount of VPLs with shadow rays for occlusion testing?

The thing that confuses me, is that some launches work and some crash the display driver.
If a single launch works and Ive split it up like shown above, shouldnt it work for all of them?
They all shoot the exact same amount of rays, ofc some terminate earlier than others and lit areas therefore take longer to process than dark ones, but to go from super fast processing to crashing seems like a stretch to me, dont you think?

you can change “display driver crash” settings in windows registry, see Testing and debugging TDR during driver development - Windows drivers | Microsoft Docs
However if you trace 50K shadow rays, possible solution is split this tracing into multiple parts, for example 100x500 rays. Main problem is, that you must trace 100x camera rays. You can use these camera rays as anti-aliasing rays by some subpixel origin offset or if you do not need anti-aliasing and you have enough of memory, you can store camera hit data into buffer in first time and read this data for remaining 99 camera rays.

You can’t expect 50000 rtTrace() calls per individual launch index to be fast enough. That is a recipe for timeouts.

At least not in one launch on your current system setup, but your case is easy to split into more but faster launches.

That experiment contains the answer to how you’ll need to split up your work.
Since lighting is additive you can also split up the work by only evaluating a subset of lights per launch and then accumulate the individual launch results to the final result.

Thanks a lot! I’m not sure why tracing a subset of VPLs seemed wrong to me at first.
Guess I was quite confused, now everything is clear and working.
Tracing a subset of VPLs is also a lot better than a tiled approach for what I’m doing and
now I can easily control how much time each launch takes.