How to optimize the curves' rendering?

zhoubosheng · July 12, 2021, 7:11am

Here, I use Optix 7.3 with nvidia driver 471.11 on Windows 10.
In my case, I want to render a hair with over 100 thousand strands( about 50 segments per strand). The rendering work costs much time(about 15s per frame in the specific samples and bounces. And I found nearly all of the time is costed by the optixLaunch function. I don’t know how to optimize the rendering, so what should I do?

dhart · July 12, 2021, 4:32pm

Hi @zhoubosheng,

Can you tell me more about what you’re workload looks like? I don’t have enough information to know if 15 seconds per frame is fast or slow. Do you have an estimate of how many rays per second you are getting, meaning the number of optixTrace() calls?

Other information that will help include which GPU you’re using, what kind of curves you’re rendering, what resolution, how many threads you are launching, how many bounces you allow, what rendering algorithm you’re using, what kind of shading you’re doing, etc.

If you are using OptiX built-in curves (and not your own custom curve intersector), then there are two very easy things you can do to make it go much faster: you can use the linear or quadratic curve types, and you can use the BVH build flag OPTIX_BUILD_FLAG_PREFER_FAST_TRACE.

Linear curves are the fastest to render, but because they are linear, you might be able to see undesirable artifacts like straight segments and angled joints. Quadratic curves are faster than cubic, but not quite as fast as linear, however quadratic will look very similar to cubic and avoid the visible segmenting that linear has.

The build flag OPTIX_BUILD_FLAG_PREFER_FAST_TRACE will split OptiX curve AABBs into more smaller bounds, which renders faster at the cost of more memory. Currently the ratio is approximately 2x, meaning in normal cases we expect it to render up to 2x faster, at the cost of up to 2x more memory usage for the curves.

There are probably other ways to optimize your rendering that we can discuss once I know a little more about your workload and goals. If shading is the bottleneck, then you may need to use Nsight Compute to profile your shaders to find out what’s going on.

–
David.

zhoubosheng · July 13, 2021, 4:09am

Thanks for your reply, David.

I’m sorry that I forgot some important information in the above topic. I use the RTX 2080Ti (TU102). The render resolution is 1600*1200, and the samples per pixel is 256, the max bounces is 10. I implemented the paper’s algorithm A Practical and Controllable Hair and Fur Model for Production Path Tracing as the hair’s shading model.

Here are some host pseudo-code, which is in the render function.

...
// every frame
for ( size_t i = 0; i < spp; ++i) {
    // update some global params for samples in device...
    optixTrace();
    // additive the result to film
    addtiveResult();
}
// filter of every frame
normalizePixel();
...

Here are some device pseudo-code, which is in the .cu files:

void __raygen__output_pinhole()
{
    // get some info
    // init sampler
    while (bounce < max_bounces)
    {
        // intersect ray with scene
        optixTrace(...);
        
        // if the ray miss, break the loop

        // intersection operations....
        computeDirectLighting(...);
        
        // sampling for next ray
        samplingNextRay(...);

        // possibly terminate the path with Russian roulette.
        russianRoulette(); 
    }

    // record some result
}

In my code, I have used the flag OPTIX_BUILD_FLAG_PREFER_FAST_TRACE, and it’s very efficient as you mentioned. I want that the hair’s curve will perform as good as the Arnold in maya, so I use the cubic type.

The image is an example of my renderer:(the pink color will cost more time that the brown color)

zhoubosheng · July 13, 2021, 7:35am

And I have profiled with Nsight Compute. But it’s diffcult to parse the content for me. So if the Nvidia has an official organization to support consumer’s profiling ?

dhart · July 14, 2021, 6:16pm

So we can calculate an upper bound for your current rays / second: 1600 * 1200 * 256 * 10 * 2 is approximately 9.8 billion rays, in 15 seconds, or about 655 million rays per second. That speed is actually faster than I expect for cubic curves based on your picture and GPU, but I’m certain my ray estimates here are much higher than what you actually have. It will help you understand performance if you can calculate the total number of rays you cast for an entire frame.

I put the factor of 2 in there because I’m assuming you are using next event estimation as part of computeDirectLighting(), meaning you shoot a shadow ray for every hit point in your path. Is there only 1 light source, and only 1 shadow ray per bounce? With Russian Roulette, your actual total rays will not be 10 billion, but much lower. It depends entirely on your roulette probability, but maybe your total is a factor of 2 or 3 lower?

The intersection rate of curves is highly dependent on the curve data and camera angles and several other factors. I have not done a lot of testing using a 2080 Ti, but I might expect a 2080 Ti to be able to calculate cubic curve intersections at a rate of around 500 million rays / second when using OPTIX_BUILD_FLAG_PREFER_FAST_TRACE and given a data set like in the picture you have. Our internal benchmarks range from 250 million to 1 billion rays per second on RTX 8000, which I expect is 10-15% faster than 2080 Ti, but please note I’m guessing a lot here.

In your code sample, I guess the top snippet is host (CPU) code, right? So the optixTrace() call there is referring to a launch, and not to tracing a single ray, right?

The two most likely places you can optimize your algorithm are the Russian Roulette part and the shading. With Russian Roulette on the GPU, the most important thing to know is that it will not save time until all active threads in a warp can exit. This is up to 32 threads, but with ray tracing you commonly have fewer than 32 active threads in a hit shader, because some rays will miss the geometry during traversal.

You already know what the probability is that a single thread will exit early due to Russian Roulette. You can use that number to estimate the probability that an entire warp will exit early. If pt is the probability of terminating a thread, then the probability a warp will exit pw ~= pt ^ #_active_threads. So if your Russian Roulette probability is 50%, for example, then the probability that a warp will terminate if you have 16 active threads is 1 / 2^16 ~= or slightly more than one in a million chance (almost never). Or another way to think about it is that with 16 active threads and 50% Roulette probability, you can expect to take 5 bounces on average before all threads exit. Remember that as long as any thread in a warp is active, then the whole warp is active, and having a low number of active threads in a warp is one form of divergent behavior: very bad for parallelism and performance.

An easy way to test things is to simply disable Russian Roulette and see how different the timings are. If they stay the same, then Russian Roulette is not helping, and a lot of potential compute time is being wasted. There are some strategies people use to try to minimize divergence when using Russian Roulette on the GPU, it may be worth some investigation.

For shading, that’s where Nsight Compute profiles are needed in order to identify the bottlenecks. The most common bottlenecks are memory usage, so understanding where the loads and stores are will help you think about your data flow.

One easy thing you can do is disable your shading completely, and test how long the render takes. This will give you a good idea of what portion of your frame is traversal+intersection only, and what portion is shading only.

Another easy thing to do is render primary rays with direct lighting only, and disable bounces and disable samplingNextRay(). This way you should be able to easily count the total number of calls to optixTrace() so you know exactly how many rays you are casting, and get a very good idea of the performance for primary rays. This combined with disabling the shading will help you isolate the OptiX performance and allow you to see whether you should try to optimize your curve data & OptiX setup, or whether you should focus on the shading and rendering algorithm.

It is sometimes effective to put your spp loop inside raygen and use a single launch. This eliminates the launch overheads. In your case, I don’t expect this to make any meaningful improvement, it could actually make things much worse due to Russian Roulette. But it’s so easy to do that it might be worth trying, and if it does make things much worse, that is a useful piece of information.

–
David.

zhoubosheng · July 15, 2021, 3:29am

Thanks a lot, @dhart . Your suggestion is greatly appreciated. I will try to test and give you a reply later.

zhoubosheng · July 16, 2021, 6:28am

Hi, @dhart . I’m back. Here are some profiling results.
For certain reasons, I used new parameters (spp: 512, max_bounces: 10, resolution: 1600x1200) for testing, but still on the same computer.

Some statistics from rendering process of the image above.

Total ray count: 3,773,141,032.
Ray count per pixel: 3.83824
Time cost: 28769.1 ms

So, the ray tracing time cost much higher than benchmarks, right?

Russian roulette
I use the implement of pbrt-v3 for Russian Roulette part, and it’s a dynamic thresold. If I turn off this part, the time cost will 10%-15% higher for the pink hair. But when I changed the material that rendered a brown hair. the time cost will 2x higher, So, I think the Russian Roulette part is efficent for darker hair, and is less efficent for lighter hair which needs more samplers and bounces and rays terminated later, right? Anyway, changing the thresold’s strategy will be benifit, and the rendering performance is eligible.
Only primary rays
In this case, I will disable the bounces, disable the shading and disable samplingNextRay(), the total rays count: 1600x1200x512 is approximately 983 million, and it will cost about 2.08s. So, the performance for primary rays is reasonable? Try to optimize my curve data is useless or not?

dhart · July 16, 2021, 4:11pm

Yes, both of those make sense to me. With the modern hair BRDF, I believe lighter hair depends more on transmission rays which will increase the render time.

And yes, with the shading disabled you’re getting very close to 500 million curve intersections per second. That’s more or less exactly what I would expect, and I don’t think there is any performance problem that needs to be optimized necessarily. However, if you have plenty of memory to spare, you still have the option to resample or split your hair curves at a higher rate, so that you will have more segments along each strand. This is exactly what the OptiX flag OPTIX_BUILD_FLAG_PREFER_FAST_TRACE is doing currently, but you have the option to control the sampling yourself, if you want to, and you can gain some more speed at the cost of using more memory.

You also have the option to try quadratic and/or linear curves, as I mentioned before. I realize those may come with some additional concerns, but keep in mind that the linear intersector is frequently around 2x the speed of the cubic intersector. If you were to resample your curves for performance anyway, then linear becomes easier to use because you’re less likely to see linear joints. Many pro renderers already do linear sampling of cubic curves when rendering, with 8 or 16 samples per segment. In our tests, splitting cubic curves once or twice helps a lot with performance, but we see the benefit of higher samples rates will fall off pretty quickly due to increasing overlap of the bounds of neighboring segments. While this is completely dependent on your input data, I feel like it’s fairly common for performance to flatten out with more than 8 samples per cubic curve segment.

Also we can see that with 512 spp and path tracing, you’re getting a rendering rate of around 3.77Brays / 28.8sec ~= 131 million rays per second for your overall path tracing. Your hair data & camera are covering less than half of the view, so we have to account for most of your primary rays being fast misses. But this is not bad for an overall rendering rate when using many-bounce path tracing along with sophisticated shading. This new number allows us to see that your indirect rays and shading together are perhaps around 4 times slower than primary rays (maybe a bit less because of the camera view), which means that when looking for places to optimize, there is likely to be more benefit by looking at shading and perhaps the ray direction sampling too, than by adjusting the curve data. I don’t know the state of the art of hair BRDFs on the GPU, but I think there might be some algorithms in the GPU research community for allowing some correlation in the direction sampler to try to keep more coherence between neighboring threads.

So overall, I think it would be reasonable to come to the conclusion that your rendering speed is respectable enough and focus on other features. Ampere GPUs in our curves tests are around 2x faster than comparable Turing models (more than 2x if you use motion blur), and we will continue to improve the GPUs, so things will still get faster with the lazy approach. ;) And at the same time, there is some room for improvement if you really want to get the fastest speeds possible, and you have the time to spend on it. To get faster, you can use curve resampling and/or linear curves, you can research GPU-specific algorithms for Russian Roulette and for BRDF direction sampling, and you can start doing some Nsight Compute profiling and optimizing of your material shader.

BTW, I can help you read your Nsight Compute profile, if you’re willing to share it and discuss it on the forum. If you’re not able or just not comfortable with that, I’m happy to answer questions about it or give general pointers. The Nsight Compute guide is also worth reading through https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.

–
David.

zhoubosheng · July 23, 2021, 8:20am

Hi David,
Sorry for getting back to you so late.
I also tried to test the linear curve last week. And yes, the linear intersector is about 2x the speed of the cubic intersector. But the render result of linear curve has some artifacts, and it’s not smooth enough.
I think resampling curves will be more benefit and I will try it later.
I’m discuss with my colleague about sharing the Nsight Compute profile. I appreciate that you could help us read it. Thanks again for your replying. I will keep you posted, @dhart .