Benchmarking: count actual hit rays, and relative hit rays


In an interactive rendering, e.g., optixPathTracing example, during the moment of the scene with mouse, we see some random black pixels which are supposed to be filled with color (just like this figure).

Screenshot from 2022-07-21 14-29-17

These black pixels, however, gradually converged into color pixels if the interaction has been stopped. Please correct me if I am wrong, my assumption about this due to motion after primary bounce, the secondary rays very probably could not reach the light source.

Suppose, in an ideal situation (no miss ray, only closest hit) I have 10*10 pixels to render, and I am using 4 samples per pixel, restricted to 2 bounces. So, in the scene I will have totally 4*10*10 = 400 paths, and (4*10*10)+(4*10*10)+(4*10*10) = 1200 rays). But due to the motion, as I am watching some black pixels, I assume I will receive less rays than the original number.

I guess there is a way in OptiX to count these numbers, absolute hit number, and relative hit number but cannot figure out how to do that. Could you please give some suggestions?

I do something like this, which avoids any need for locking when counts are incremented

  • Use cudaMalloc to allocate an array of int or short, as needed for each unique ray type. The size is width * height * sizeof count
  • Use cudaMemset to zero the array.
  • Pass the address of the array as a field in the parameter block specified when I set the Optix compile options
  • In each hit program or miss program I have a code block like this
#ifdef TIMING
    const uint3 dim = optixGetLaunchDimensions();
    const uint3 idx = optixGetLaunchIndex();
    const int rayIndex = idx.y * dim.x + idx.x;
    runtimeParms.profileCounters[RAY_MISS_COUNTER][rayIndex] = runtimeParms.profileCounters[RAY_MISS_COUNTER][rayIndex] + 1;
  • After the host program returns from the optixLaunch function call, I use cudaMemcpy to copy each of the count arrays back to the host.
  • Then for each array, the host program iterates over the array elements and accumulates a total count.
  • If I need to time the o[ptixLaunch function call, I use cudaEventRecord, cudaEventSynchronize and cudaEventTime in the host program.
  • Call cudaFree to free the counter arrays that were allocatedPreformatted text
1 Like

Related thread about counting rays here, basically the same thing drwootton explained, I would just hold and increment the counters on the per-ray payload and write once at the end of the ray generation program.

I would use += 1 to increment that counter. It’s CUDA C++ after all. :-)

But due to the motion, as I am watching some black pixels, I assume I will receive less rays than the original number.

The Cornell Box scene inside the optixPathTracer is open in the front. Of course not all rays will have two bounces then. All rays leaving the box through the open front will reach the miss shader which is black and if you’re not implementing direct lighting or the surface hit was in shadow during direct lighting, there will not be any additional radiance from that path. That’s what progressive Monte Carlo path tracing algorithms always look like.

Note that there was a bug in older optixPathTracer versions, handling emissions incorrectly. OptiX SDK 7.5.0 contains a fix.

1 Like

@drwootton1, Thank you very much for your suggestion. Due to my limited understanding of CUDA, and OptiX, I am having a hard time decrypting the instructions. Let’s say in the optixPathTracing example,

// in optixPathTracer.cpp 
unsigned profileCounters ;// declaration 
//this also goes to optixPathTracer.h`

// under 
void initLaunchParams(PathTracerState& state)
CUDA_CHECK(cudaMalloc( reinterpret_cast<unsigned **> (&state.params.profileCounters), state.params.width * state.params.height * sizeof (unsigned)));
  state.params.profileCounters = {};

Did you mean something like this? What is RAY_MISS_COUNTER? Is there any code snippet?

@droettger, Thank you for your explanation. Instead of the black background, I changed the color to white and during the interaction I can see the random white pixels. So, I guess for implementation purpose Cornell Box is good, but for testing the path tracing performance, I may use more complicated scene, right? Would it be possible to replace the vertex positions and properties with Crytek Sponza model?

In my question, didn’t I write it wrong?

So, in the scene I will have totally 4*10*10 = 400 paths , and (4*10*10)+(4*10*10)+(4*10*10) = 1200 rays)

I think, as OptiX uses single ray rendering model, through each of the pixels, there will be only 1 primary ray (total 10x10 primary rays), 4x10x10 secondary reflected rays, and 4x4x10x10 tertiary reflected rays. 4 is the sample per pixel, and limited to 2 bounces (no miss ray).

So, in total there should be 100+400+1600 = 2100 rays for the scene and 1600 paths until these hit the light source. I am a bit about the terminologies, sample per pixel and ray per pixel.

Fortunately, I am already using OptiX 7.5 after its first release.

1.) To your ray counting question:

The cudaMalloc() is correct.
You allocate a device buffer with the launch dimension many elements of unsigned int type which should receive your ray counts.

The line state.params.profileCounters = {}; is incorrect though.
state.params.profileCounters is a pointer to CUDA device memory. You cannot initialize that to zero like you would do for host structures. Instead you need to initialize that with a CUDA function.
If you just want to count the numbers for a single optixLaunch, there is no need to initialize the data inside the profileCounter element per launch index. You can simply write the result once at the end of the ray generation program into the linear index you calculate from the launch dimension (width) and each launch index.
All OptiX examples show that for their output buffers.

If you want to calculate the rays over multiple progressive optixLaunch calls, you should initialize the buffer elements only once.
That could either be done with a call to cudaMemset()
resp. the CUDA Driver API calls cuMemsetD8/D16/D32().

Or you can simply set the counter to zero at the beginning of the ray generation program for the very first progressive launch only.
There is usually a “subframe” or “iteration” index variable inside the launch parameters of progressive renderers which indicate which frame they are, used to initialize the random number generators differently and when accumulating the radiance results.

You then add a counter variable to your per-ray payload, increment that before every optixTrace call, and write that at the end of the ray generation program into the desired output buffer.

That the profileCounters are an array in the example code above only means there is an array of these buffer for different counters, where you only allocated one.
You could also allocate a buffer of structures with different counter fields for hit, miss, whatever.

If you need many different counters, check the required counter range and use smaller data types when memory usage if a concern.

Since you allocate a counter entry per launch index, you can count rays or whatever per launch index.
In a progressive Monte Carlo ray tracer the number of rays is usually much higher than the number of samples per pixel. Each sample per pixel is one path with one to many rays, the number of optixTrace calls per path.

2.) To your scene question:

The optixPathTracer example is using a hardcoded scene inside a single geometry acceleration structure.
If you want to use the Crytek Sponza model you would need to replace everything regarding geometry and materials against something else. That is basically a rewrite of the host application part.
Since the Crytek Sponza model is fully textured, the shader would also need to be enhanced to handle that and you would need to add a 2D texture loading algorithm as well.

That is not too difficult when you’re completely familiar with OptiX 7, and there are examples which do that.

The older OptiX SDKs before OptiX 7 had an OBJ loader based on tinyobjloader which was used inside the optixMeshViewer SDK example.
The optixMeshViewer inside the OptiX SDK 7.x examples switched to the more capable glTF format, but searching for “tiny” inside the OptiX SDK 7.5.0, shows that the optixMotionGeometry example is still using the tinyobjloader.

My OptiX 7 examples (rtigo3 and nvlink_shared) can load the Sponza scene geometry just fine, but I did not implement an automatic handling of the MTL files (the material definition of an OBJ file). Though it’s possible to replace materials by their name inside my scene description. It’s a little manual work to write a scene description text file which handles the diffuse textures inside one of my examples (nvlink_shared can define them per material) but that should be doable. The scene_rtigo3_instances.txt scene description file there loads an OBJ file.

Then there are the OptiX 7 SIGGRAPH Course examples from Ingo Wald, and if you follow that link and scroll down the images, you’ll see the Sponza scene.
(Issues with these examples need to be raised inside that github repository alone, since these are Ingo’s private examples as you can see in their copyright message. I won’t answer questions about them here.)

Note that the Sponza scene is open at the roof, so it will have similar cases where rays are reaching miss shader like the Cornell Box scene, but yes, it’s a much more demanding benchmark scene than the hardcoded Cornell Box with its meagre 32 triangles.

1 Like

Just as example what I meant with scene description for the Sponza scene in my nvlink_shared program
These files load the scene, renderer everything with diffuse reflection materials, some with diffuse texture and some with cutout opacity textures. It’s not handling any of the other MTL material parameters, esp. not the bump mapping, simply because there are no shaders implemented doing bump mapping in these simple examples.

This is also not making optimal use of the changed textures in this version of the Sponza scene (I have the original as well, there all fabric is red) because the cutout opacity resides also inside the alpha channel of the diffuse texture and I load the explicit one.
The renderer is not handling the other material parameters inside the MTL file, esp. not the bump maps, and some materials could be replaced with different BRDFs (eg, the chains could be metal (rough GGX).

The attached scene description file expects the Sponza files in a local folder named Sponza inside the executable folder but that can easily be changed to other folders.
system_nvlink_sponza.txt (6.1 KB)
scene_nvlink_sponza.txt (2.9 KB)

Call with command line: nvlink_shared.exe -s system_nvlink_sponza.txt -d scene_nvlink_sponza.txt

Note that the example renderer is not the fastest possible implementation because it’s using direct callable programs for the BxDF sampling and evaluation for maximum flexibility and least amount of shader code.
It’s faster to implement explicit closest hit programs instead.
I’m currently working on more examples which will demonstrate a different shader binding table layout.

1 Like

My code looks quite different from the Optix samples since I’m using Qt as my GUI widget library and runtime instead of OpenGL, and I’m not using the sutil functions in the samples.

Looking at the optixPathTracer example, I would issue the cudaMalloc call to allocate the counter array in initLaunchParams, as well as the cudaMemset to initialize the counter array to zero. This would be a one time call in your application.

I’m counting multiple events including ray generation, ray hits and ray misses as individual arrays of counters. I allocate one large array using cudaMalloc. I then treat the array in my code as a 2-dimension array where the first dimension (RAY_MISS_COUNTER) is the type of counter and then the second dimension is the element index rayIndex = idx.y * dim.x + idx.x.

Since this array is allocated by cudaMalloc, it is in GPU memory, so the pointer to that array has to be set in parameter structure passed in the optixLaunch call. referring to optixPathTracer, this address is set in the state.d_params field in initLaunchParms.

If this is of any help, in my code I allocate and initialize the counter array once. Then my loop that calls the optixLaunch a bunch of times runs, and then at the end of the loop I use cudaMemcpy to copy the counter arrays back to the host, sume the counters for the individual ray indexes for a counter type and then print my totals.

You can ignore the emit statements in the code, and any class names starting with ‘Q’, such as QString. That’s all related to the QT runtime I’m using.

One time initialization code, where I also set up CUDA events since I’m timing the code:

void RaytraceThread::initPerfCounters(void) {
countBufferSize = scene->width() * scene->height() * sizeof(short) * NUM_COUNTERS;

cudaError_t rc = cudaMalloc(reinterpret_cast<void**>(&countBufferBase), countBufferSize);
if (rc != cudaSuccess) {
    emit postError(QString("Unable to allocate profile counter memory: %1").arg(QString(cudaGetErrorString(rc))));
rc = cudaMemset(reinterpret_cast<void*>(countBufferBase), 0, countBufferSize);
if (rc != cudaSuccess) {
    emit postError(QString("Unable to clear profile counter memory: %1").arg(QString(cudaGetErrorString(rc))));
countBuffer = countBufferBase;
for (int i = 0; i < NUM_COUNTERS; i++) {
    hostOptixParams.profileCounters[i] = reinterpret_cast<unsigned short*>(countBuffer);
    countBuffer = countBuffer + scene->height() * scene->width() * sizeof(short);


The code that runs repeatedly in the loop, where launchPipeline issues the optixTrace call:

#ifdef TIMING
// Draw the current scene frame
if (!launchPipeline()) {

    float gpuTime;
    cudaEventElapsedTime(&gpuTime, startEvent, endEvent);
    totalGpuTime = totalGpuTime + gpuTime;
    iterationCount = iterationCount + 1;

if (!launchPipeline()) {

The code that is called after the loop completes, copying the counter arrays back to the host, summing the individual counter arrays and printing the totals

void RaytraceThread::dumpPerfCounters(void) {
unsigned short *gpuCounters = new unsigned short[scene->width() * scene->height() * NUM_COUNTERS];

cudaError_t rc = cudaMemcpy(gpuCounters, reinterpret_cast<void*>(countBufferBase), countBufferSize, cudaMemcpyDeviceToHost);
if (rc != cudaSuccess) {
    emit postError(QString("Unable to copy GPU count buffer: %1").arg(QString(cudaGetErrorString(rc))));
unsigned long counterTotals[NUM_COUNTERS] = { };
const char *counterNames[NUM_COUNTERS] = {
    "primary rays generated", "reflection rays", "lambert rays", "occlusion rays", "refraction rays", "ray hits", "ray misses", "shadow hits", "shadow misses", "translucent hits" };
int counterIdx = 0;
for (int i = 0; i < NUM_COUNTERS; i++) {
    for (int j = 0; j < scene->width() * scene->height(); j++) {
        counterTotals[i] = counterTotals[i] + gpuCounters[counterIdx];
        counterIdx = counterIdx + 1;
for (int i = 0; i < NUM_COUNTERS; i++) {
    printf("Total %s: %'ld\n", counterNames[i], counterTotals[i]);
printf("%'ld primary rays generated in %f seconds, %'f per second\n", counterTotals[RAYGEN_COUNTER], totalGpuTime / 1000.0f, counterTotals[RAYGEN_COUNTER] / (totalGpuTime / 1000.0f));
long totalRays = counterTotals[RAYGEN_COUNTER] + counterTotals[OCCLUSION_COUNTER] + counterTotals[REFLECTION_COUNTER] + counterTotals[LAMBERT_COUNTER] + counterTotals[REFRACTION_COUNTER];
printf("%'ld total rays generated in %f seconds, %'f per second\n", totalRays, totalGpuTime / 1000.0f, totalRays / (totalGpuTime / 1000.0f));
delete[] gpuCounters;


1 Like