Some questions about ray

whix · April 18, 2023, 2:24am

I have some questions about OptiX that I don’t quite understand. Can each ray intersect at most one triangle(in my code, I found that each ray can only intersect with one triangle, maybe I wrote something wrong)? Is there a limit to the coordinates of the triangle vertices? What requirements does the size of the triangle primitive need to meet to ensure that it can be detected by ray(sometimes the ray fails to intersect the triangle, and when I increase the size of the triangle, it succeeds. This happens when the number of triangles is relatively large, such as 57708624)? Thanks!

droettger · April 18, 2023, 7:43am

No, each ray can intersect with more than one triangle, but if that happens depends on how your application is implemented. Below is the longer explanation.
Note that all that information can be found inside the OptiX Programming Guide.

The acceleration structure (AS) traversal is started with an optixTrace call. That defines the current ray which is checked for intersections with the primitives inside the AS.

What happens with the current ray on intersections with the geometric primitives inside the AS depends on

the OptixGeometryFlags used for the AS build inputs,
how the shader binding table (SBT) is setup with what OptiX programs of what domains and
what ray flags have been used in the optixTrace call.

The entry point of a ray tracing pipeline is always the ray generation program.
That shoots the primary ray with an optixTrace call and the SBT arguments in that call control which ray type to use by selecting the SBT index according to the formula at the beginning of the OptiX Programming Guide chapter 7.3 Acceleration Structures. That formula is crucial to understand how SBTs work.

The hit record inside the SBT can have three program domains: intersection, anyhit, and closesthit program. Which of them are needed and used depends on the application’s intent (means the above bullet points).

Built-in triangles have an implicit intersection program.
All other geometric primitives inside OptiX must set an intersection program, where the ones for built-in curves and spheres need to be queried from OptiX using optixBuiltinISModuleGet and only custom geometric primitives must implement their own intersection program.

Now the frequency of which OptiX program domains are called is different.

The intersection program is called for any axis aligned bounding box (AABB) over a geometric primitive with which the ray intersects during AS traversal. Means it’s the most often called program and should be implemented as performant as possible!
It’s called in AS traversal order, not in ray direction order!
Usually the intersection program changes the ray’s t_max values of the interval (t_min, t_max) defined by the optixTrace call, as long as a closer intersection is found, until it arrived at the closest hit. The intersection distance of that is the t_max value queried with the device function optixGetRayTmax.

Then depending on the intersection with the actual geometric primitive (triangle, curve, sphere, custom) the anyhit program is called when it’s defined inside the SBT hit record.
That means that anyhit program can be called multiple times per ray.
It can even be called multiple times per primitive if that was split into multiple AABBs during the AS build for performance optimizations. If you want to count primitives inside the anyhit program, you must set the OptixGeometryFlag OPTIX_GEOMETRY_FLAG_REQUIRE_SINGLE_ANYHIT_CALL.

Now if there is an anyhit program inside the SBT hit record, that can influence what happens with the currently traversed ray with the device calls optixTerminateRay after which the associated closesthit program is called. Mind that this is not necessarily the closest hit then.
This is one way to implement shadow/visibility ray types which only need to check if any geometry is inside the tested ray interval.
A similar but faster mechanism to achieve this (recommended) is the ray flag OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT which causes the very first hit, that is not ignored in anyhit, to abort further traversal, defining it as the closest hit. This is faster than calling into an anyhit program and the shadow ray implementation only requires a miss program then.
Demonstrated in this example: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo10/shaders/brdf_diffuse.cu#L188

The other device function which influence what happens with the current ray traversal is optixIgnoreIntersection. That causes the current potential intersection to be discarded, so t_max is not changed and the ray traversal continues.

If you reach the closest hit or miss program the current ray is done. Meaning the acceleration structure (bounding volume hierarchy) traversal started with the most recent optixTrace call is finished and nothing more is intersected by that ray.

Is there a limit to the coordinates of the triangle vertices? What requirements does the size of the triangle primitive need to meet to ensure that it can be detected by ray(sometimes the ray fails to intersect the triangle, and when I increase the size of the triangle, it succeeds)?

That cannot be answered without knowing the absolute values for the triangle and ray values of your case.
The OptiX ray is defined in 32 bit floating point precision. It’s most likely a floating point precision issue.
The effective floating point precision depends on the range of your values. Please search the web for “IEEE 754 floating point number precision” and you’ll find sites which explain what value range offers what precision.
The ray-triangle intersection routine itself is watertight in OptiX, means when shooting rays at the common edge of two adjacent triangles, the ray hits one of them.

In general it’s a good idea to make your world size extents reasonably small and to limit your ray t_max to that world size.

whix · April 24, 2023, 7:09am

Thank you for your reply. The problem I am facing: there are many triangles in the direction of my ray. I want to record all the triangles detected in the anyhit program, but I call optixTrace() once, and it can only intersect with one triangle. Then call the closest program and the ray is done. If I want to continue detecting triangles, I need to move ray_origin and call optixTrace() again until I have traversed all triangles. If multiple triangles can be detected consecutively after one call of optixTrace(), how should i modify the code, is there any performance difference between this method and the method I use?

OptixAccelBuildOptions accel_options={};
    accel_options.buildFlags=OPTIX_BUILD_FLAG_NONE;//
    accel_options.operation=OPTIX_BUILD_OPERATION_BUILD;

    const uint32_t triangle_input_flags[1]={ OPTIX_GEOMETRY_FLAG_REQUIRE_SINGLE_ANYHIT_CALL };
    OptixBuildInput triangle_input={};
    triangle_input.type=OPTIX_BUILD_INPUT_TYPE_TRIANGLES;

    primitive_vertices_buffer.alloc_and_upload(primitive_vertices_);
    CUdeviceptr d_vertices=primitive_vertices_buffer.d_pointer();

    triangle_input.triangleArray.vertexBuffers=&d_vertices;//CUdeviceptr*
    triangle_input.triangleArray.numVertices=(primitive_vertices_.size());
    triangle_input.triangleArray.vertexFormat=OPTIX_VERTEX_FORMAT_FLOAT3;
    triangle_input.triangleArray.vertexStrideInBytes=sizeof(float3);

    // triangle_input.triangleArray.indexBuffer=nullptr;
    triangle_input.triangleArray.numIndexTriplets=0;
    triangle_input.triangleArray.indexFormat=OPTIX_INDICES_FORMAT_NONE;
    // triangle_input.triangleArray.indexStrideInBytes=

    triangle_input.triangleArray.flags=triangle_input_flags;
    triangle_input.triangleArray.numSbtRecords=1;
    triangle_input.triangleArray.sbtIndexOffsetBuffer=0;
    triangle_input.triangleArray.sbtIndexOffsetSizeInBytes=0;//
    triangle_input.triangleArray.sbtIndexOffsetStrideInBytes=0;

    OptixAccelBufferSizes gas_buffer_sizes;
    optixAccelComputeMemoryUsage(
        optixContext,
        &accel_options,
        &triangle_input,//buildInputs: an array of OptixBuildInput objects
        1,//number of elements in buildInputs(>=1)
        &gas_buffer_sizes
    );

    CUdeviceptr d_temp_buffer_gas;
    cudaMalloc(
        reinterpret_cast<void**>(&d_temp_buffer_gas),
        gas_buffer_sizes.tempSizeInBytes
    );
    CUdeviceptr d_gas_output_buffer;
    cudaMalloc(
        reinterpret_cast<void**>(&d_gas_output_buffer),
        gas_buffer_sizes.outputSizeInBytes
    );

    optixAccelBuild(
        optixContext,
        stream,//CUDA stream
        &accel_options,
        &triangle_input,
        1,//numBuildInputs
        d_temp_buffer_gas,
        gas_buffer_sizes.tempSizeInBytes,
        d_gas_output_buffer,
        gas_buffer_sizes.outputSizeInBytes,
        &gas_handle,
        nullptr,            // emitted property list
        0
    );

extern "C" __device__ void get_neighbors(float3 ray_origin,float3 ray_direction=make_float3(0.0,1.0,0.0)){
    // bool is_found=false;
    uint32_t num=0;
    while(true){
        uint32_t p0=0;
        // printf("optixTrace...\n");
        optixTrace(
            params.handle,
            ray_origin,
            ray_direction,
            0.0f,
            100.0f,
            0.0f,
            OptixVisibilityMask(255),
            OPTIX_RAY_FLAG_NONE,
            0,
            0,
            0,
            p0
        );
        // printf("p0 = %u\n",p0);
        if(p0==0) break;
        else{
            ray_origin.y+=1.0;
            num++;
            // is_found=true;
            // printf("ray trace once done...\n");
        }
    }
    // printf("num = %u\n",num);
    // return is_found;
}

droettger · April 24, 2023, 8:25am

First of all, why is your current code stepping along a fixed ray.origin.y increment of 1.0 units? (Also avoid doubles, use 1.0f instead.)
Stepping with a fixed increment doesn’t make sense when you want to gather all intersections along a ray and that number can vary.

After shooting the primary ray with a zero offset, the continuation rays after a triangle intersection should step with the intersection distance plus some scene dependent epsilon to avoid self-intersections with the hit triangle.
For that you return the positive floating point intersection distance you query with optixGetRayTmax() inside the closesthit program inside the payload register and return -1.0f for miss.
That would automatically skip all empty space between triangle intersections along the ray direction and only shoot as many rays as there are intersections along that ray, until you reach the miss program.
Something like this:

const float intersectionDistance = __uint_as_float(p0);
if (intersectionDistance >= 0.0f) // Hit?
{
  ray_origin.y += intersectionDistance + sceneEpsilon; // Skip empty space and the hit triangle.
  // For arbitrary directions actually: ray_origin += ray_direction * (intersectionDistance + sceneEpsilon);
  // This might need some vector operator overloads you can find for example in https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/rtigo10/shaders/vector_math.h
  ++num;
}
else // Negative intersectionDistance means miss program reached: End of path.
{
  break;
}

You could also just keep the ray orgin and direction intact and only increase the ray.tmin value and keep the ray.tmax the same. Just make sure ray.tmin is always less than ray.tmax, otherwise you’ll get invalid ray exceptions.
This would generate smaller ray intervals with each step which should speed up the BVH traversal.

Also that get_neighbours() function is called by your __raygeneration__ program?
Then it should be put into the same module as that and not use extern "C".
__forceinline__ __device__ void get_neighbors would be enough then.

Please get this working perfectly before trying other solutions.

The BVH traversal and ray-triangle intersection would be fully hardware accelerated by the RT cores on RTX boards. Current high-end boards can reach well over 10 GRays/second doing that, which would normally only be achieved with comparably simple scenes and user device programs which do very few memory accesses.
The maximum rays/second performance is usually limited my the memory bandwidth on RTX boards, so you should first try if this iterative approach is good enough when done correctly.

As explained above,
The other device function which influence what happens with the current ray traversal is optixIgnoreIntersection. That causes the current potential intersection to be discarded, so t_max is not changed and the ray traversal continues.

That means if you want to gather all intersections along a ray inside an anyhit program, you would need to store the intersection data you need inside the anyhit program and call optixIgnoreIntersection at the end.
That way the closest hit program will never be called and the ray ends when reaching the miss program.

The problem with this approach is, that the anyhit program will interrupt the hardware BVH travseral on RTX boards and call back into your anyhit program for each potential intersection.
Additionally intersection and anyhit programs are not called in ray direction but in BVH traversal order, so you would need to sort your gathered interersections by intersection distance if you need them in ray direction order
The iterative approach above could store each hit into a global buffer inside the raygeneration program.

If you actually need the intersection information (distance, triangle ID, etc) and not only the number of intersections along the ray, then you would need to have room for all intersection results per ray.
That’s either a buffer allocated a-priori and written for each intersection inside the ray generation with that iterative approach above, or some per ray payload structure which is provided as 64-bit pointer in two 32-bit payload register. Both might result in rather large memory requirements depending on your launch dimensions and number of maximum possible intersections per ray.

This has been discussed multiple times on this forum already.
Please have a look especially at these threads:
https://forums.developer.nvidia.com/t/ray-mesh-intersections/170090/5
https://forums.developer.nvidia.com/t/whats-your-solution-to-get-all-hit-primitives-of-multiple-rays/239528/2
and at all forum threads found by the search link in this post:
https://forums.developer.nvidia.com/t/can-i-not-enter-the-miss-shader/246130/2

whix · May 5, 2023, 1:17pm

Hello! I have some other questions:

In optix, can the coordinates of a triangle be negative or the ray_origin contain a negative number?
When ray_direction contains negative numbers, such as ray_direction=(0,-1,0), in optixTrace(), can tmin be greater than tmax?
Is there any good way to solve the 32 bit floating point precision problem?
How is the intersection of a ray and a triangle calculated?

Thank you for your reply.

droettger · May 5, 2023, 2:11pm

Yes, triangle positions and ray origin are just float3 values.
The ray direction is also just a float3 and it should be set to a normalized 3D vector, but it can be any direction.
ray.tmin must always be positive and less than (not equal to) ray.tmax, means 0.0f <= tmin < tmax. Everything else raises an invalid ray exception.
The interval [tmin, tmax] checked for intersections is along the ray direction.
Depends on what you’re doing. The 23-bit mantissa inside the IEEE 754 32-bit floating point definition limits the available precision.
Make sure your scene is not exceeding those limits, try keeping everything under ~8 million units. Best precision is between [-1.0, 1.0].
Pick a bigger world unit, scale your scene smaller, use geometry object coordinates around the origin and place your objects into the scene with instance matrix transformations.
Not sure what you’re asking for. The RTX boards do that in hardware using the specialized RT cores.
You can find software triangle intersection functions in the older OptiX SDK versions (before 7.0.0) but these are not watertight.
For example, install OptiX SDK 6.5.0 and have a look at the function intersect_triangle inside the include header optixu_math_namespace.h.
Of course implementing triangles as custom primitives with such user defined intersection programs will be way slower than letting the RT cores do that in hardware.

whix · May 11, 2023, 1:12pm

hello!
When I call optixLaunch() continuously in a loop, I find that the execution time of the first execution is much longer than the execution time of the subsequent calls, even if the number of threads set in the first execution is smaller than that of the latter (other parameters remain unchanged), is this reasonable?
Thank you for your reply!

droettger · May 11, 2023, 4:18pm

Please note that questions about benchmark results require absolute numbers and a description about how the results were measured exactly and the system configuration information.

optixLaunch calls, like all OptiX API entry points which take a CUDA stream argument, are asynchronous.
Depending on how you measured that, you might have measured something else still running on the device, like an acceleration structure build.

Please read this thread: https://forums.developer.nvidia.com/t/two-gas-with-no-ias/245993/2

whix · May 12, 2023, 1:52am

Sorry, the time is measured with the code below:

cudaEvent_t bfs_start,bfs_end;
float rt_time;
cudaEventCreate(&bfs_start);
cudaEventCreate(&bfs_end);
cudaEventRecord(bfs_start,stream);
optixLaunch(
    pipeline,
    stream,
    reinterpret_cast<CUdeviceptr>(d_params),
    sizeof(Params),
    &sbt,
    1,
    1,
    1
);
cudaEventRecord(bfs_end,stream);
cudaEventSynchronize(bfs_end);
cudaEventElapsedTime(&rt_time,bfs_start,bfs_end);

After the ray is emitted, the only thing to do is to intersect with the triangles and record the number of triangles.

unsigned int p0=0;
optixTrace(
        params.handle,
        ray_origin,
        ray_direction,
        params.curr_tmin,
        params.curr_tmax,
        0.0f,
        OptixVisibilityMask(255),
        OPTIX_RAY_FLAG_NONE,
        0,
        0,
        0,
        p0
    );

extern "C" __global__ void __anyhit__ah(){
    const unsigned int primitive_id=optixGetPrimitiveIndex();
    unsigned int point_count=optixGetPayload_0();
    optixSetPayload_0(point_count+1);
    optixIgnoreIntersection();
}

Below are the measurement results, on the left is the number of intersecting triangles, and on the right is the time.

1: 4.54554
2: 0.028672
3: 0.034656
4: 0.018432
5: 0.018432
6: 0.019296
7: 0.019392
8: 0.019456
9: 0.021376
10: 0.021504

OS version: Ubuntu 20.04.4, GPU: RTX 3090, Driver Version: 515.105.01, CUDA Version: 11.4.

droettger · May 12, 2023, 10:15am

I’m unable to reproduce this under Windows 10.

I added the same event record calls into my intro_runtime example from here https://github.com/NVIDIA/OptiX_Apps and added a local loop over the optixLauch measurements and the first call is not taking longer.

Could you add a cudaStreamSynchronize(stream); before your cudaEventCreate call and see if the behavior changes?

Also note that it’s recommended to check every CUDA API call for errors in case something went wrong.
Find the resp. CUDA_CHECK and CU_CHECK macros for the CUDA runtime and driver API I’m using in the above linked examples.

You could also analyze your application performance behavior with Nsight Systems: https://developer.nvidia.com/nsight-systems

dhart · May 12, 2023, 4:15pm

It might also be worth trying to synchronize between the last cudaEventCreate and first cudaEventRecord? Something’s definitely going wrong with the first launch and/or measurement.

Two other thoughts:

When timing, I always lock my graphics clocks to a value somewhere below the maximum (such as 75%) to avoid the possibility for thermal throttling. This test is so small it won’t heat up the GPU, but there have been times occasionally when it seemed like my clocks are slow to spin up from idle as well, so locking your clock before you run your test may rule out such issues. You can lock your graphics clock using nvidia-smi -lgc <clock>, and you can unlock it when done using nvidia-smi -rgc. nvidia-smi can be used to query your maximum and acceptable clock values as well. You could consider querying your clock value in code before the launch as well, if you want to check whether low clocks might be implicated here.

Also please note that with a 1-thread launch and single-digit number of triangles, you are only measuring pure OptiX launch overhead. Maybe that’s what you’re after, but if you really want to make a shmoo plot of intersect time for rays vs triangles, you’ll need to have much, much higher counts of both to see any meaningful differences (i.e., I guess the difference between launch 8 & 9 is measurement noise, and nothing to do with having one more triangle.) This is just to say, you’re not measuring the time to cast a ray at all. In order to measure per-ray times and per-triangle times, you need a lot of them.

In any case, the 19 microseconds you’re seeing after the first few launches sounds in the neighborhood of what’s expected for launch overhead, depending on clocks and GPU model and other things, and this time is generally consumed entirely by launch setup, e.g. processing of your launch params, and the host-side API calls involved. This might help you think of reasons the first launch can be unexpectedly slow; because the OptiX launch setup is doing some device memory operations, which means that it can stall if there are other memory operations already in progress at the same time on the GPU. This could include memory operations that you initiated before your OptiX launch, and it can also include memory operations the GPU is doing to service your display (if you’re timing things on your display GPU), or memory operations initiated by your operating system or other applications utilizing the GPU. If you do have a spare GPU sitting around that you can use to drive your display (or a separate machine to run the test remotely over ssh or whatever), and you can run this test on the 3090 without a monitor connected, that might also yield different timings or behavior with your test.

–
David.