GPU program optimization questions

I’m learning how to use Optix on Fedora 35 Linux, CUDA 11.5, Optix 7.4, GPU driver 495.29.05 and a RTX 3060.

I have a program that displays some simple cubes and I am looking at a profiling run using nv-nsight-cu.

When I look at the details page of the report it says I have low compute throughput (50%), memory throughput (30.94%) and DRAM throughput (25.07%) and it’s suggested I look at scheduler and warp state statistics. It’s also suggested I look at source counters.

The source counters section mentions several uncoalesced global accesses where one has an unexpected/unexcpected ratio of 2.0 and the others have a ratio of 1.0, where I think 1.0 is ok.

The case where the ratio is 2.0 is a simple array assignment to a uchar4 array element aligned to 4 bytes.
params.image[idx.y * params.image_width + idx.x] = charResult;

Reading documentation, it seems that my code is setting alignment correctly to 4 bytes so I’m wondering what I’m missing.

The occupancy section of the report says theoretical occupancy is limited to 33% due to required registers and due to shared memory.

Do I have any control of shared memory with Optix? I don’t believe I am declaring any myself.

How do I limit the number of registers I use? Do I keep the scope between where I set a variable and where I use it as small as possible to minimize the lifetime of the variable?

I tried compiling to PTX with tthe nvcc -O3 optimization level and then setting OPTIX_COMPILE_OPTIMIZATION_LEVEL_3 and OPTIX_COMPILE_DEBUG_LEVEL_MODERATE and that didn’t seem to change anything.

My raygen code which includes the setting of the uchar4 array element follows

static __forceinline__ __device__ void computeRay(uint3 idx, uint3 dim, float3 &origin, float3 &direction) {
    const float3 U = params.camU;
    const float3 V = params.camV;
    const float3 W = params.camW;
    const float2 d = 2.0f * make_float2(static_cast<float>(idx.x) / static_cast<float>(dim.x), static_cast<float>(idx.y) / static_cast<float>(dim.y)) - 1.0f;

    origin = params.camEye;
    direction = normalize(d.x * U + d.y * V + W);
}
extern "C" __global__ void __raygen__rg() {

    // Map our launch idx to a screen location and create a ray from the camera
    // location through the screen
    float3 rayOrigin;
    float3 rayDirection;
    // Lookup our location within the launch grid
    const uint3 idx = optixGetLaunchIndex();
    const uint3 dim = optixGetLaunchDimensions();
    computeRay(idx, dim, rayOrigin, rayDirection);
    // Trace the ray against our scene hierarchy
    unsigned int depth = 0;
    unsigned int p0;
    unsigned int p1;
    unsigned int p2;
    optixTrace(params.handle, rayOrigin, rayDirection, 0.0f,                // Min intersection distance
                    100.0f,                   // Max intersection distance
                    0.0f,                     // rayTime -- used for motion blur
                    OptixVisibilityMask(255), // Specify always visible
                    OPTIX_RAY_FLAG_NONE,      // No ray flags
                    NORMAL_RAY,               // SBT offset   -- See SBT discussion
                    NUM_RAY_TYPES,            // SBT stride   -- See SBT discussion
                    NORMAL_RAY,               // missSBTIndex -- See SBT discussion
                    depth, p0, p1, p2);
    float4 result;
    result.x = int_as_float(p0);
    result.y = int_as_float(p1);
    result.z = int_as_float(p2);
    result.w = 1.0f;

    // Record results in our output raster
    uchar4 charResult = make_color(result);
    params.image[idx.y * params.image_width + idx.x] = charResult;
}

Thanks, Dave

I don’t see anything you could optimize in the given code.

I would adjust two things:

  • You wouldn’t need to load the camUVW vectors into local variables if they are used only once, but the compiler will optimize that away.
  • The payload registers are unsigned int. (Same for intersection attribute registers.)
    The reinterpret to float should use __uint_as_float(): result.x = __uint_as_float(p0);
    Note that CUDA 11.5 deprecates the casts without the leading underbars.
    The OptiX SDK examples unfortunately use integers as well, but the OptiX Programming Guide explains it correctly.

I’m wondering what I’m missing.

The Nsight Results will also contain everything else happening internally. The overview section isn’t too helpful when optimizing your part of the device code. It’s more interesting when comparing multiple runs against a base line to see what changed.
Instead go to the source code view where you should be able to see the CUDA , PTX, and SASS code when you compiled your modules with line information. That needs to be set inside the nvcc command line options, the OptixModuleCompileOptions, and the OptixPipelineLinkOptions!

If you profile the OptiX kernel (__raygen__ entry points) with Nsight Compute, you can then look at the individual traced events inside Nsight Compute. Look at the Instructions Executed and use the “highest” and “next lower” buttons to find where your kernels spend most time. Then try optimizing that code.

Do I have any control of shared memory with Optix? I don’t believe I am declaring any myself.

No, you cannot use shared memory inside OptiX device pr0grams. The programming guide mentions this inside chapter 6.2 Programming Model.

How do I limit the number of registers I use?

You can try setting the OptixModuleCompileOptions.maxRegisterCount, but that would be premature optimization again.
It’s recommended to leave it at OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT (== 0) to let OptiX decide how to compile the code.

Do I keep the scope between where I set a variable and where I use it as small as possible to minimize the lifetime of the variable?

OptiX needs to track variables which are live across calls to optixTrace or direct or continuation callables. The fewer live variables, the more things can be held in registers, the less variables needs to be saved to the stack.
Please search the OptiX programming guide for the word “register” which will turn up some chapters which talk about that. When searching for “performance” there will be more general tips about how to improve that in other areas.
There is a Performance Guidelines chapter inside the older OptiX 6.5.0 Programming Guide which contains some recommendations which are still applicable to OptiX 7 programs.

I tried compiling to PTX with the nvcc -O3 optimization level and then setting OPTIX_COMPILE_OPTIMIZATION_LEVEL_3 and OPTIX_COMPILE_DEBUG_LEVEL_MODERATE and that didn’t seem to change anything.

The new OPTIX_COMPILE_DEBUG_LEVEL_MODERATE is documented to have an impact on performance.
You should use OPTIX_COMPILE_DEBUG_LEVEL_MINIMAL which keeps only the line information for profiling and
OPTIX_COMPILE_DEBUG_LEVEL_NONE to remove even that.
Never profile compute kernels build as debug! That will completely change the code structure and does not represent the fully optimized code.
Also important for optimal performance would be to have use_fast_math enabled when compiling the CUDA source code to the PTX input.
Please read the OptiX Programing Guide chapter 6.1 Program Input what other nvcc command line options should be used.

I would recommend adding a functionality to your device programs which benchmarks the time per OptiX launch. Mind that all OptiX API calls which take a CUDA stream argument are asynchronous and require a synchronization of that stream to get the correct runtime for a benchmark.
If you then also add a functionality to count the number of rays shot per OptiX launch, you can determine if you’re getting better or worse results when changing code. (For more precise results, run the benchmark without the ray counting code enabled.)

There are many factors which can influence the performance of an OptiX ray tracer. The more convergent the rays are, the better the per warp occupancy, the better the performance. Means divergent rays like in path tracers after the fist bounce will show lower than theoretically possible performance naturally.

It’s also possible to underutilize the GPU with too small launches and that depends on the number of compute cores.

So from you current code, I would say leave that as is and don’t try to optimize that with Nsight Compute unless you see that the performance of the OptiX launch at a given number of rays is not improving.

Some general coding strategies are:
Inlined code will be faster than callables.
Try reducing the number of optixTrace calls inside your device code.
Try moving the optixTrace outside of deeply nested code.
Reduce the number of live variables.
Prefer vectorized loads, that is, use 2- and 4-component vector types when possible. 3-component vector types will be loaded and stored as three scalars by CUDA.
Other optimizations would be around data structures, and the overal scene structure. You can easily build inefficient acceleration structures as well where AABBs are badly nesting or overlapping for example.

This is helpful. Thanks.
When I build my acceleration structure, the top level of my structure is an OptixBuildInput that refers to an array of OptixInstance structures where OptixInstance::transform does not specify any transformations. Each of these OptixInstance structures refers to a set of OptixStaticTransform structures to apply scale, rotate and translate transforms, and then the last of the OptixStaticTransform strucures refers to the OptixBuildInput structure, type OPTIX_BUILD_INPUT_TYPE_TRIANGLES that has the vertex array for that mesh.

Is it faster for acceleration structure build time and for pipeline launch time to just have the OptixInstance structure have a single transform array where I have computed the resulting matrix from applying all the transforms or to have a set of OptixStaticTransform that apply each transform individually?

Is it faster for acceleration structure build time and for pipeline launch time to just have the OptixInstance structure have a single transform array where I have computed the resulting matrix from applying all the transforms

Absolutely! The matrix transform inside the instance is for free on GPUs with RT cores.
Also the BVH traversal would be a different one for single-level and multi-level traversal.

Each of your additional transforms would be a level inside the transform list and you would need to manually concatenate the resulting transform inside the closest hit shaders every time, where the single level hierarchy can just pick that single matrix directly.

The code difference can be seen inside the my example code below, where the first knows that the scene has a single level hierarchy (IAS → GAS, what OptiX describes with OptixPipelineCompileOptions.traversableGraphFlags = OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING) and uses some own hardcoded getters, and the second has a motion transform in between which requires to walk through the transform list. You can look at the provided helper functions inside the OptiX SDK to see how involved the latter is for transforms and its inverse.
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_driver/shaders/closesthit.cu#L153
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_motion_blur/shaders/closesthit.cu#L72

Keep your scene hierarchy as flat as possible. There are cases where it’s required to have multiple IAS, for example when the scene complexity gets too large. Or when additional transforms cannot be avoided like for motion blur.
In any case, always make use of the instance transform if you can to keep the transform list as short as possible.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.