Variable sampling rate

Hi!

I was trying to do a short experiment with variable sampling rate based on OptiX SDK 7.4’s optixpathtracing example. I am using Ubuntu 22.04 LTS with NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8.

The idea is to define a circular region and if the pixels inside that region, use predefines sampling rate, else use another sampling rate. And it will be dynamic where the center is the cursor position.

For that I extracted the mouse cursor position with glfw library and passed the data to __raygen__rg() shader. The modified code snipped looks something like this:

    const uint3 launch_index = optixGetLaunchIndex(); 
    double center_x = params.cx; // received from glfw
    double center_y = params.cy;

    const int radius = 600; // defined radius size
    float3  accum_color = ((((launch_index.x - center_x) * (launch_index.x - center_x)) +
                            ((launch_index.y - center_y) * (launch_index.y - center_y))) <= (radius * radius)) ? result / static_cast<float>( params.samples_per_launch ) : result / static_cast<float>(params.periphery_per_launch );

    // defined params.samples_per_launch = 4
    // params.periphery_per_launch = 16
// rest of the code are almost same as optixpathtracer code

Now it is working. But the problem is, while moving the cursor and shifting the circular region, the previous circular region is still visible in its old position. Something like the figure below (top left corner, the brightest white circle is light source).

new_overlapping

I think it is because somehow I messed up with the output buffer, or the buffer is just overlapping. I do not want to see this overlapping circles, and with movement of the mouse, there should be only one circle with variable sampling rate in and out. A close shot looks like this:

Could you please give some guidelines what I am doing wrong.

Sounds like you’re not correctly resetting the subframe index to zero which triggers the initial write of the output buffer inside the ray generation program.

If you look closely at this code inside the __raygen__rg() function, you’ll see that it writes the color to both buffers without accumulation when the subframe_index == 0.

    if( subframe_index > 0 )
    {
        const float                 a = 1.0f / static_cast<float>( subframe_index+1 );
        const float3 accum_color_prev = make_float3( params.accum_buffer[ image_index ]);
        accum_color = lerp( accum_color_prev, accum_color, a );
    }
    params.accum_buffer[ image_index ] = make_float4( accum_color, 1.0f);
    params.frame_buffer[ image_index ] = make_color ( accum_color );

Then if you debug through the optixPathTracer host code, you’ll see that it calls updateState() before launching and that resets the params.subframe_index to zero when the camera parameters or the window client size changed.

void updateState( sutil::CUDAOutputBuffer<uchar4>& output_buffer, Params& params )
{
    // Update params on device
    if( camera_changed || resize_dirty )
        params.subframe_index = 0;

    handleCameraUpdate( params );
    handleResize( output_buffer, params );
}

You would need to enhance that if-statement there to include each change of your center point as another reset condition.

Some comments to your code excerpts:

  • Do not use double types for the center_x, _y variables and params.cx, .cy values. Use float.
  • Do not use individual float fields for 2D vectors, use a float2 which loads faster.
  • You can precalculate your squared radius if that is all you need.
  • Use vector operations where possible. (CUDA is not implementing all vector operations. The usual ones are implemented inside the vec_math.h helper header or my vector_math.h version of it which has some more.)

So something like this:

const float radius_squared = 600.0f * 600.0f;
const uint2 launch_index = make_uint2(optixGetLaunchIndex());
const float2 delta = make_float2(launch_index) - params.center; // center is a float2 with your cx, cy values.
const unsigned int num_samples = (dot(delta, delta) <= radius_squared) ? params.samples_per_launch : params.periphery_per_launch;
float3  accum_color = result / static_cast<float>(num_samples);

To get correct color results would of course require that the respective launch indices have been using the same number of samples in the do-while loop over the samples count, so that “launch index is inside the radius” calculation and selection of the samples per launch index would need to happen before the do-while loop.

1 Like

Thanks @droettger for the detailed instructions. Setting subframe_index == 0 solved the overlapping problem.

If I may ask two more questions related to this:

  1. As I am varying the sampling number wrt. the cursor position, cannot I say this a screen space dynamic foveated rendering? I am currently studying Adaptive Sampling and guess I maybe right. What do you think?

  2. About your comment

To get correct color results would of course require that the respective launch indices have been using the same number of samples in the do-while loop over the samples count, so that “launch index is inside the radius” calculation and selection of the samples per launch index would need to happen before the do-while loop.

I think the do .. while() is the most important part of __Raygen__rg() shader as it is shooting the rays per pixel and also controlling the bounces. Are you suggesting here instead of one while loop, I should use two loops? Because, for example

// optixPathTracer.cpp
params.samples_per_launch = 8;
params.periphery_per_launch = 2;

...
// from optixPathTracer.cu
do{
...
}while(--params.samples_per_launch)
// this is definition to any of the sampling rate only 

Let’s start from the bottom.

The code line while(--params.samples_per_launch) won’t even compile because params is __constant__.

The ray generation program is called once per launch index, the values you get with optixGetLaunchIndex().
The optixLaunch(..., width, height, depth) arguments set the launch dimension, the values you get with optixGetLaunchDimension() on the device side.

If the optixLaunch is 2D and width and height match the output buffer’s dimension, you’re running the ray generation program once per pixel inside the output buffer.

The do-while loop in there controls how many ray paths (the for(;;) loop) this unidirectional path tracer starts into the scene. That’s the number of samples per pixel (spp).

That spp count is controlled by the statement int i = params.samples_per_launch; directly before the do { } while(--i); loop.

If you want to shoot a different number of spp you must calculate that i value differently and finally divide the accumulated color values by the same spp count before writing it to the output buffers.

There is not a need for two loops for diffenent spp counts at all. You’re not thinking “parallel computing” enough.

...
const float radius_squared = 600.0f * 600.0f;
const uint2 launch_index = make_uint2(optixGetLaunchIndex());
const float2 delta = make_float2(launch_index) - params.center; // center is a float2 with your cx, cy values.
const unsigned int num_samples = (dot(delta, delta) <= radius_squared) ? params.samples_per_launch : params.periphery_per_launch;

int i = num_samples; // Different spp value inside and outside the circle.
do {
...
} while (--i);
...
float3  accum_color = result / static_cast<float>(num_samples); // Must use the same spp.
...

To 1.)
This above algorithm is a standard thing many path tracers use to focus more samples on a specific region of an image to get a faster feedback during interactive look development. You usually pick the area with objects using a material you’re tuning and let a selected area refine faster while changing the material parameters.
Nothing special at all and also not what I would call foveated rendering.

You’re still rendering all pixels inside the output buffer. Foveated rendering algorithms supported by Variable Rate Shading try to reduce that number because the resolution of HMDs and their refresh rate is ever increasing.
Though VRS can also be used to increase the shading quality to reduce the aliasing on areas or materials which need it. (Follow both links)

You’re using a very simple unidirectional path tracer as foundation and you’re reducing the typical high frequency noise from Monte Carlo algorithms inside the circular region by shooting more samples. That’s similar to Variable Rate Supersampling (also described on the VRS site linked above).
That also means the periphery will be noisier, in a region where the human perception is even more susceptible to movement.

A lot of foveated rendering research went into reducing the sampling rate to below the number of pixels on the screen while reducing flickering inside the periphery regions. Just have a look through the research publications from NVIDIA.

1 Like

Hi!

Thank you once again for all the knowledge. The output now like this

If I may add a few comments related to foveated rendering. Now I understanding this is really a super-sampling where I am defining an area and adding more samples inside (e.g.,8) and outside of the circle is 1 spp. But if I change this sampling rate wrt. the original Contrast Sensitivity Function (CSF), that should be a foveated path tracing then?

You’re still rendering all pixels inside the output buffer.

Yes, I am rendering all the pixels, but with a lower number of sampling rate, and let’s say the sampling rate is mimicking the CSF. For example, if I have a 4K size rendering window, don’t I need to render all 4K pixels as the output buffer? What I can think of in another way,

  • render lower amount of pixels, e.g., 600*600and upscale it to match the 4K resolution
  • and/or render pixel blocks (e.g., 2*2)per sample in the peripheral region as VRS

If that I do, will that be called foveated rendering? Sorry, I am very off-topic than OptiX now.

It’s Detlef.

You need to fill all pixels on the HMD somehow. If the fovea and periphery area are sampled at different rates you can call that foveated rendering if you like. I would mainly use that term if the number of pixels rendered is less than the numbers of pixels displayed.

If you render the fovea area with a higher resolution than the periphery, means rendering your 600x600 pixel area at the gaze position in the native 4K resolution and not upscale it, that would be foveated rendering.

There are even HMDs (from Varjo) which use two different displays per eye to render the center of the screen in a better DPI to achieve that effect physically.

On the CSF, from glossing over some images and articles, that is another mean to determine the visual acuity of humans over frequency and contrast.
I’m not sure how to apply that information to different sampling rates for the fovea and periphery. Wouldn’t that require another dimension of that CSF describing how that graph changes with the distance to the fovea center.

Also changing the sampling rate according a person’s individual acuity of the frequency of patterns over contrast would require to know that contrast of the currently rendered image at these pattern frequencies in the CSF, but you don’t really have that information while rendering that image itself, esp. when sampling below the CSF pattern frequencies (Nyquist sampling theorem comes to mind), so this would need more information to do that sampling rate change at all, maybe from previous images. But each time you change the gaze, that data changes.
Good luck with that.

1 Like

Hi Detlef, :)

Your suggestions are always the most valuable to me. I will carefully take these into consideration. At this moment I am implementing the idea of foveation on a multi-display setup, e.g., 3*3 4K displays. Using the threshold sampling rate, I can already see a performance increment of more than 50% though. I think now I need to carefully investigate how I can make it more (accurate) foveated rendering mathematically.