Passing buffers from an RT_PROGRAM to a __device__ function

The following doesn’t work (Expected direct reference to global variable):

/// file: spd.h

// Interpolate SPD at wavelength lambda
__forceinline__ __device__ auto spd_value_at(optix::buffer<float> spd, float lambda) -> float {
    // do a linear interpolation on the spd...

/// file:
rtBuffer<float> spd;

RT_PROGRAM void closest_hit() {
    float value = spd_value_at(spd, 440.0f);

From reading the docs, it seems like the way to make this work is to use a rtBufferId, presumably creating the buffer, getting its Id, then creating a single-element buffer to feed the Id to the closest_hit program (or using rtVariableSetUserData), which can then pass the Id on to the spd_value_at function.

Is that right?

It also sounds like using buffer ids precludes some optimizations optix might otherwise do. How severe is that? I ask as in my renderer users can declare arbitrary values to pass to a shader, and I don’t necessarily know exactly how they’ll be used, which means I probably need the flexibility that passing any buffers using buffer ids would provide.

As a followup, trying to modify that function to take a buffer id like so:

__forceinline__ __device__ auto spd_value_at(optix::bufferId<float, 1> spd, float lambda) -> float {
    // ...snip ...
    return (1.0f - t) * spd[i0] + t * spd[i1];

errors when compiled using nvrtc (I’m using nvrtc directly rather than rtuCompile… mainly because I didn’t know the rtu functions existed until recently). The exact same code compiles and runs fine when precompiled with nvcc.

/Users/anders/packages/optix/5.0.0/include/optix_device.h(468): error: identifier \"make_index\" is undefined
          detected during instantiation of \"T &optix::bufferId<T, Dim>::operator[](optix::bufferId<T, Dim>::IndexType) const [with T=float, Dim=1]\" 
/Users/anders/code/rama/rama/src/spd.cuh(12): here

/Users/anders/packages/optix/5.0.0/include/optix_device.h(469): error: identifier \"create\" is undefined
          detected during instantiation of \"T &optix::bufferId<T, Dim>::operator[](optix::bufferId<T, Dim>::IndexType) const [with T=float, Dim=1]\" 
/Users/anders/code/rama/rama/src/spd.cuh(12): here

I can’t say why NVRTC complains about make_index() in your case.
It’s implemented inline inside the struct buffer template and used inside the buffer’s operator implementation, all inside optix_device.h.

Hmm, it’s defined inline instead of forceinline. inline is just a hint.
Would be worth a try what happens when changing the four make_index() functions to forceinline.

Since NVRTC found that header, I assume you provided the OptiX and CUDA headers as additional include directories inside the NVRTC options.
Mind that NVRTC doesn’t support all compile options NVCC supports.
Here’s a thread with how I’m using it:

If you’re using optixpp_namespace.h then I would use the rtBufferId defined instead of optix::bufferId for better code style. Since you say this is generally working with NVCC, it should also with the rtBufferId.

I also use this RT_FUNCTION define to make device code look more consistent to the existing RT_PROGRAM and RT_CALLABLE_PROGRAM defines:

For completeness, (not affecting the make_index() usage in the buffer’s operator implementation):
At its core the rtBufferId is only an integer variable. Means there are other ways to structure the code by using just an int with the Buffer’s ID as function argument, which you cast to the necessary buffer type.

RT_FUNCTION auto spd_value_at(int spd, float lambda) -> float {
  // ...
  return (1.0f - t) * rtBufferId<float, 1>(spd)[i0] + t * rtBufferId<float, 1>(spd)[i1];
// Or:
RT_FUNCTION auto spd_value_at(int spd, float lambda) -> float {
  // ...
  rtBufferId<float, 1> buf = rtBufferId<float, 1>(spd);
  return (1.0f - t) * buf[i0] + t * buf[i1];

Just don’t use the rtu functions. They are from times where NVRTC wasn’t available or working correctly and still required a CUDA toolkit to be installed on the target machine.
NVRTC is the better way to compile device code at runtime. CUDA 8.0 was the first version where it worked for OptiX. It’s also about three times faster than doing the same with NVCC which is nice when, for example, generating material shaders at runtime like some of my renderers do.

BTW, if you’re still on OptiX 5.0.0, I would seriously recommend to update to OptiX 5.1.x or OptiX 6.0.0.

I’m still clinging desperately to my 2013 rMBP for most of my development and I think I saw on another thread that Kepler doesn’t work properly on 6.0?

EDIT: Checking release notes I see that Maxwell is the minimum supported generation for 6.0. Guess it’s time to buy that RTX laptop after all…

My compile options when using nvrtc are:

let mut compile_options = vec![

and when using nvcc:

	-I${OPTIX_ROOT}/include \
	-I${CUDA_ROOT}/include \
	-I. \
	--machine=64 \
	--ptx \
	-arch=sm_30 \
	--relocatable-device-code=true \

I can confirm changing make_index’s inline to forceinline makes no difference, and I get the exact same errors trying to use nvrtc from CUDA 9.0 with Optix 5.1 on Ubuntu

Changing optix::buffer<float, 1> to rtBufferId<float, 1> in the definition of spd_value_at gives:

"/Users/anders/code/rama/rama/src/spd.cuh(5): error: attribute \"__device__\" does not apply here

What’s the best practice these days around when to compile more programs and when to use callable programs or switch statements?

Your not setting –use_fast_math?
That will result in much slower code for trigonometric functions and reciprocals.
It’s recommended to always set --use_fast_math in OptiX device code.
Look for approx in your PTX code before and after.

Correct. If you’re working on Kepler just don’t use OptiX 5.0.0 anymore but at least update to an OptiX 5.1.x version to benefit from all fixes which went into the releases.

Depends on what you’re doing. If all your programs are static, just compile them in your project with NVCC.
If you’re compiling shaders at runtime that’s easier and faster to do with NVRTC.
I’m doing both. The core programs are built by the project, the bindless callable programs for my material hierarchy are built with NVRTC at runtime, depending on the scene contents.
Though that also requires the OptiX, CUDA and my renderer’s headers on the target machine.

My renderer architecture looks basically like the block diagram at the end of this page:
When having runtime generated bindless callable programs for each unique material shader, the closest hit program does more calls than in those examples, but the fixed function code for BSDF sampling and evaluation and light sampling is just the same structure, I just have many more of them in my MDL capable renderer.
This is the most flexible and smallest OptiX kernel for that amount of features I could come up with.
This approach allows all kinds of light transport algorithms, because I can sample and evaluate BSDFs and lights not only inside the closest hit program domain.

EDIT: Related discussion:

The previous mindset of renderer architectures was to avoid tracing rays at all costs because that was the most expensive operation. Now that ray traversal and triangle intersection is hardware accelerated, that is no longer true. The recommendation is always to make the shading as efficient as possible, which is more important now than in the past.

The OptiX SDK CMakeLists.txt containss this option to switch compilers:

# Select whether to use NVRTC or NVCC to generate PTX
set(CUDA_NVRTC_ENABLED ON CACHE BOOL "Use NVRTC to compile PTX at run-time instead of NVCC at build-time")

I have not used that option: -std=c++14. I would have written your function as

RT_FUNCTION float spd_value_at(int spd, float lambda)
  // ...
  const float f0 = rtBufferId<float, 1>(spd)[i0]; // When possible spread these two memory reads out.
  const float f1 = rtBufferId<float, 1>(spd)[i1];
  return optix::lerp(f0, f1, t); // This uses one multiplication less than your code.

This is all guesswork. I would need to try myself what happens when using buffer IDs as parameters.

Instead of buffers have you considered using 1D textures (via bindless texture IDs) with linear filtering and wrap mode clamp_to_edge for that data? The linear filtering is for free then.

I haven’t gotten far enough for it to cause a noticeable speed difference yet :) enabling it did however cause severe artefacts (circular-patterned acne) on a 400-ish-unit high cornell box ( using the triangle intersection code from the SDK examples. I plan to return to that and try and increase the accuracy of the epsilon calculations (and implement same-prim skipping), but for now I’m ok without it.

Hadn’t thought of that, thanks!

Thanks for the link to the other discussion about shader graphs as well.

Incidentally, nvrtc also doesn’t seem to like hex floating point constants like 0x1p-32f

The solution to make_index not being found was indeed the std=c++14 compiler option. Some weird C++ name resolution thing perhaps?

From your observations I would simply suspect that NVCC and NVRTC are not at the same implementation level, with NVRTC being behind.
I don’t know if that has changed with newer CUDA versions, but you’re stuck at CUDA 9.0 with OptiX 5.1.x.

We’ve encountered the same issue with NVRTC in another project now and the solution was to change the buffer operator implementation inside the optix_device.h header to call make_index(i) and create() via an explicit this-> pointer. Then NVRTC should also compile it with the -std=c++14 option.

__device__ __forceinline__ T& operator[](IndexType i) {
      size_t4 c = this->make_index(i);
      return *(T*) this->create(type<T>(), rt_buffer_get(this, Dim, sizeof(T), c.x, c.y, c.z, c.w));