Optix -> Raytracing Benchmark


Is there a way to get a benchmark of rays_cast/second rather than frames per second?

You mean by changing some command line parameter in the OptiX SDK examples? No.
In general? Yes.

I would recommend to add an unsigned int counter to the per ray payload and increment that before every rtTrace call inside the device code and add that counter at the end of the ray generation program to a 2D input_output unsigned int buffer which sums your rays per launch index. Initialize the values in that buffer to zero at the beginning of your benchmark.

If you let that run for some seconds and then sum all counters together on the host you should have a pretty accurate number of rays per time result. Make sure the time it takes to sum the counter on the host is not included in the benchmarked time.

Note that counting the rays will obviously shift the overall performance a little due to the additional work.

Thanks, I suppose I was hoping for an out of box solution to be lazy. haha.

Should do fine. :)


I’ve created a uint buffer which is being updated with the ray count. but I’m having trouble reading in the uint value from the optix buffer.
does one of the examples in the sdk show this? if not, how can I access the values of the buffer?

That’s explained in OptiX 3.8.0 Programming Guide Chapter 3.2 Buffers

From my understanding this chapter only explains the Optix side of the code, not the code you would use in the host C++ program.

No, that chapter explains both the host- and device-side handling of buffers, only that it’s using the actual OptiX C-API.

If you don’t know how the C-API is wrapped by C++ objects and functions, simply read the optixpp_namespace.h header which implements that.
The documentation in that header can also be found inside the OptiX API Referenece.

Working through the provided OptiX examples sample5 and sampl5pp shows how the host code looks with the C-API and the C++ wrappers in an otherwise identical program.

Buffer mapping, reading/writing, and unmapping is one of the fundamental things in OptiX. Just do “Find in Files” for “rtBufferMap” or “->map()” on the OptiX SDK samples folder source code files and you’ll get plenty of examples.

Awesome, thanks for clearing that up, Finally managed to get it working,

Theoretically on a 640x480 resolution scene, considering that there are no intersections (only counting primary rays), we should have 307200 rays being generated.

However, My output shows only ~1500 rays in a single frame being shot in the scene without a object in it (all rays miss, theoretically there should only be primary rays.

My program starts by incrementing the count on the primary ray (in the camera program) then two per collision (one reflection and one refraction). so I should be accounting for each ray being shot into a scene.

Does Optix use some other method for shooting rays into a scene which can account for the difference in primary rays I’m reading?

If you shoot 648x480 primary rays and all miss, your result buffer should contain all “1” as value in every cell.
=> Debug memory window on the mapped pointer (maybe at a smaller resolution like 64*48) and you should see.
If that’s not the case, you forgot to initialize some counter.

Post code or it didn’t happen. ;-)
The relevant pieces would be the counter buffer initialization, the counting inside the device code, the write-back or accumulation at the end of the ray generation program, and the evaluation of the sum inside the host code.

Haha, I’m using a single cell and incrementing that value.
How I create the buffer and assign it

Buffer b = m_context->createBuffer(RT_BUFFER_INPUT_OUTPUT);
  int vals[] = {0};
  memcpy(b->map(), vals, sizeof(vals));
  buffers = static_cast<int*>(b->map());
  std::cout << buffers[0] << std::endl; // double check to see if we inserted the correct value, it comes out correct

Declaring the buffer in the optix code

rtBuffer<int, 1> raysTracedThisFrame; // should allow for a buffer with room for one integer

How I’m counting the rays in my camera program.
This should only count Primary rays, as this is the only point I’m accumulating the value in my optix program.

raysTracedThisFrame[0] =  raysTracedThisFrame[0]+1;
rtTrace(top_object, ray, prd);
output_buffer_f4[launch_index] = prd.result;

And the following is in my trace method within the c++ code.
This is only called once per frame, so theoretically it should only count the rays generated in that frame.

int vals[] = {0}; //re-initialize a zero value
  memcpy(b->map(), vals, sizeof(vals)); // copy the zero value to buffer
  m_context->launch( 0, static_cast<unsigned int>(buffer_width),
		                  static_cast<unsigned int>(buffer_height) );
  buffers = static_cast<int*>(b->map()); // grab the resulting buffer.
  std::cout << "Rays This Frame: " << buffers[0] << std::endl; // Output the rays in this frame

And the following is what I get as a result over several frames where there is no object present
(So it should only shoot primary rays, no secondary rays are generated)

Rays This Frame: 1618
Rays This Frame: 1683
Rays This Frame: 1579
Rays This Frame: 1642
Rays This Frame: 1595
Rays This Frame: 1624
Rays This Frame: 1543
Rays This Frame: 1591
Rays This Frame: 1611
Rays This Frame: 1664
Rays This Frame: 1664
Rays This Frame: 1553
Rays This Frame: 1557
Rays This Frame: 1622

I’ve verified the memory is correctly allocated, and checked to see if I get the correct value by having my camera program set the integer value to 200, and I get the correct output when I do so. so I’m fairly sure I’m not missing any initialization.

Sorry for being a pain in the neck, and thanks for helping out.

I forgot a line, this is how I set the buffer in the optix program


Ok, that’s not going to work for any (buffer_width * buffer_height) > 1.

Think parallel!

If the launch size is bigger than that, all currently running threads would do this operation in parallel: raysTracedThisFrame[0] = raysTracedThisFrame[0] + 1;

That obviously incurs a read/write race condition among threads, means results are not deterministic.

It’s easily fixable by using atomicAdd(&raysTracedThisFrame[0], 1); instead of that line.
DON’T!!!, because that will slow down your whole rendering and skew your rays/seconds benchmark a lot and wouldn’t work on multi-GPU.

That is why I recommended to use a 2D counter buffer with the size of your launch dimension to write the counted rays per launch index. That way everything keeps running in parallel and the sum is just a simple for-loop on the host.

AH, that makes perfect sense, I didn’t think of that, Completely forgot the fact that Optix runs in parallel.
I’ll give it another go later today and post back.

Thanks again for putting up with my ignorance XD

It all works now, thanks again!

Hello. I just went through this question. I have been using AtomicAdd to count number of rays in my simulation till now and its been working perfectly. I just read that it would effect the performance .

If I go with the suggestion of Detlef, wouldnt it take too much memory as I am using million of rays and I have to use for loop in my host machine which obviously be slow.
I have to mention that my method is iterative, so if I send buffer backand forth to the host and device it would be slow.

So, I just need an opinion that for my case, AtomicAdd is better or the above suggestion.
Thank you.

Depends on what you want to measure.

If you just want to count the number of rays do whatever you like.

If you want to exactly benchmark the number of rays per second, the measurement itself must not affect the performance, so that’s why counting the rays at runtime and summing up the final number in a separate step which is not part measurement will get you more accurate results.
How that works in an iterative renderer is described in my first post of this thread.
If you’re rendering an image all you need is another buffer with the size of the image, or whatever maximum launch size you use.

Even more accurate would be to use a deterministic rendering approach (same number of rays shot) and separate ray counting and benchmarking into distinct builds of the same application. Then runtime during benchmarking is maximized and you have the ray count to do your rays/second calculation.

Thank you for your detailed reply. For my case I think atomic add is better, as I have to count number of rays hitting an object on each iteration. (2 million rays per iteration). So if I make a count function in host by using a for loop over total number of rays it would be slow.

Thank you

Yes, if your launch size is not matching the result size, means if you have scattered writes, counting in the destination resolution with atomicAdd() would be the appropriate solution. That’s a different case than the original topic about rays/second benchmarking.