OptiX compilation error in validation mode

I’m getting an error when compiling my Optix shader program in validation mode with Optix 8.0:

2024-01-23 18:53:29.411 (  12.581s) [GpuWorker-0     ]  OptixRenderDevice.cpp:18     ERR| COMPILE ERROR: Malformed input. See compile details for more information.
Error: Taking the address of functions is illegal because indirect function calls are illegal. Address of function  is taken and used in function __direct_callable__eval_complex_spectrum_texture (C:/Users/Thomas/dev/Rayscaper/src/cpp/kernelshared/textures/FixedSizeStack.h:14:61)

2024-01-23 18:53:29.411 (  12.582s) [GpuWorker-0     ]    OptixIntegrator.cpp:116   FATL| FATAL: C:\Users\Thomas\dev\Rayscaper\src\cpp\kernelgpu\OptixIntegrator.cpp:116 (optixModuleCreate(optix_device_context_, &module_compile_options, &pipeline_compile_options, (const char*)embedded_optix_code, optix_ir_code_len, log_buffer, &log_buffer_size, &optix_module_)) failed with result: Invalid input [OPTIX_ERROR_INVALID
_INPUT]

When I look at my code, FixedSizeStack.h:14:61 I don’t understan how I am taking the address of a function:

#ifndef RAYSCAPER_FIXEDSIZESTACK_H
#define RAYSCAPER_FIXEDSIZESTACK_H

#include <cstdint>

#include "GpuGlobals.h"

namespace kernelshared {

// Stack with a fixed number of elements, no memory allocations required to use this stack.
template <typename ElementType, uint8_t StackSize>
class FixedSizeStack {
 public:
  RS_CPU_GPU FixedSizeStack() : stack_pointer_(0) {} // <<< THIS GOES WRONG

  // Returns the stack size.
  RS_CPU_GPU uint8_t GetMaxSize() const;

  // Returns the number of elements on the stack.
  RS_CPU_GPU uint8_t GetSize() const;

  // Returns TRUE if this stack has no elements.
  RS_CPU_GPU bool IsEmpty() const;

  // Returns TRUE if this stack has reached its maximum capacity.
  RS_CPU_GPU bool IsFull() const;

  // Pushes an element on the top of the stack, returns FALSE if the stack was full, and we could not push the element.
  RS_CPU_GPU bool Push(ElementType element);

  // Pops the top element and returns it.
  RS_CPU_GPU ElementType Pop();

  // Peeks returns the top element of the stack.
  RS_CPU_GPU ElementType& Peek();

 private:
  // The actual elements on the stack.
  ElementType stack_[StackSize];
  // Pointer to the next free slot on the stack.
  uint8_t stack_pointer_ = 0;
};

template <typename ElementType, uint8_t StackSize>
inline uint8_t FixedSizeStack<ElementType, StackSize>::GetMaxSize() const {
  return StackSize;
}

template <typename ElementType, uint8_t StackSize>
inline uint8_t FixedSizeStack<ElementType, StackSize>::GetSize() const {
  return stack_pointer_;
}

template <typename ElementType, uint8_t StackSize>
inline bool FixedSizeStack<ElementType, StackSize>::IsEmpty() const {
  return stack_pointer_ == 0;
}

template <typename ElementType, uint8_t StackSize>
inline bool FixedSizeStack<ElementType, StackSize>::IsFull() const {
  return stack_pointer_ == StackSize;
}

template <typename ElementType, uint8_t StackSize>
inline bool FixedSizeStack<ElementType, StackSize>::Push(ElementType element) {
  if (IsFull()) {
    return false;
  }
  stack_[stack_pointer_++] = element;
  return true;
}

template <typename ElementType, uint8_t StackSize>
inline ElementType FixedSizeStack<ElementType, StackSize>::Pop() {
  ElementType top = stack_[--stack_pointer_];
  return top;
}

template <typename ElementType, uint8_t StackSize>
inline ElementType& FixedSizeStack<ElementType, StackSize>::Peek() {
  return stack_[stack_pointer_ - 1];
}

}  // namespace kernelshared

#endif  // RAYSCAPER_FIXEDSIZESTACK_H

The code compiles and works fine when I don’t use validation mode.

As a side note, this is how I do validation:

 OptixModuleCompileOptions module_compile_options{0};
  module_compile_options.maxRegisterCount = OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT;
#ifdef OPTIX_VALIDATION
  LOG_F(INFO, "Running OptiX with additional validation.");
  module_compile_options.optLevel = OPTIX_COMPILE_OPTIMIZATION_LEVEL_0;
  module_compile_options.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_FULL;
#else
  module_compile_options.optLevel = OPTIX_COMPILE_OPTIMIZATION_DEFAULT;
  module_compile_options.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_NONE;
#endif

Welcome to the OptiX developer forum.

What is your system configuration?
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the module input code, host compiler version.

What is the definition of your RS_CPU_GPU macro?
It should be something like this to make sure that all device functions are really inlined. (In comments because the forum colors code weirdly otherwise).
Only OptiX entry points and callables should be actual calls.

// #if defined(__CUDACC__) || defined(__CUDABE__)
// #define RS_CPU_GPU __forceinline__ __host__ __device__
// #else
// #define RS_CPU_GPU inline
// #endif

Also you declare your functions with RS_CPU_GPU but don’t use that when defining the functions but only use inline which is just a hint to the compiler.

Does it work when changing these two things?

Setting module compile options to no optimizations and full debug info shouldn’t be necessary for the
OptixDeviceContextOptions validationMode OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL.
That will only result in slower code.

Do you translate the OptiX device code with debug information (nvcc option -G) in that case?
Because that will change quite some things in generated code an produces horrendously slow code, so I’m never doing that in my examples.

How does the code inside that __direct_callable__eval_complex_spectrum_texture() look like?

Are you using *.ptx or *.optixir module input code?

Did that change with different CUDA Toolkit versions?
Did that change with different display driver versions?
(Trying to isolate if this is CUDA or OptiX compilation issue.)

Do you need that FixedSizeStack constructor at all?
You default initialize the stack_pointer anyway with: uint8_t stack_pointer_ = 0;

It’s the only function defined inside the class, all others are only declared.
Does it change when you put the definition outside the class?

Your Pop() and Peek() functions are not checking if the stack IsEmpty() and will underflow the stack_pointer_ when being called on an empty stack. I assume your dealing with that inside the calling code.

Depending on what the ElementType is (like a struct bigger than float4) it would make sense to use const reference arguments in Push() to avoid copy. Also the local ElementType top inside Pop() isn’t required. Probably irrelevant when all functions get inlined anyway.

Hey Detlef,

Thanks for getting back to me so quickly. Here’s the missing system information:

  • OS version: Windows 10 (64-bit)
  • GPUs:
    • NVIDIA GeForce RTX 3060, compute-capability=8.6, total-memory=11GB
    • NVIDIA GeForce RTX 2060, compute-capability=7.5, total-memory=11GB
  • Optix version: 8.0.0
  • CUDA toolkit version: 12.1.105
  • Host compiler: MSVC 19.29.30146.0
  • Optix-IR: Yes

Here’s the CMake macro I am using to generate my code. I think it might come from the OptiX advanced samples, but I modified it to use a custom marker 0xfeedf00d to be able to embed optix-ir code in my binary:

macro(cuda_compile_and_embed is_debug output_var cuda_file)
    set(c_var_name ${output_var})
    set(nvcc_options "--use_fast_math --keep --relocatable-device-code=true --optix-ir --expt-relaxed-constexpr --std=c++17 --machine=64 --gpu-architecture=compute_50 -Wno-deprecated-gpu-targets --resource-usage")
    if (${is_debug})
        message(WARNING "Building Optix-IR code with debug enabled, this hurts performance.")
        # Use -G and --generate-line-info for debugging
        cuda_compile_ptx(ptx_files ${cuda_file} OPTIONS -G --generate-line-info ${nvcc_options})
    else ()
        message(STATUS "Building optimized Optix-IR")
        cuda_compile_ptx(ptx_files ${cuda_file} OPTIONS ${nvcc_options})
    endif ()
    list(GET ptx_files 0 ptx_file)
    set(embedded_file ${ptx_file}_embedded.c)
    message(STATUS "Adding rule to compile and embed ${cuda_file} to \"const char ${var_name}[];\". Debug is ${is_debug}")
    add_custom_command(
        OUTPUT ${embedded_file}
        COMMAND ${BIN2C} -c --padd 0xfe,0xed,0xf0,0x0d --type char --name ${c_var_name} ${ptx_file} > ${embedded_file}
        DEPENDS ${ptx_file}
        COMMENT "compiling (and embedding optix-ir from) ${cuda_file}"
    )
    set(${output_var} ${embedded_file})
endmacro()

Here is what my macro definitions look like - they are similar to the ones you posted, only difference is that I don’t have an inline on the RS_CPU_GPU macro when it’s compiled as host code:

// #if defined(CUDA_GPU_ENABLED) && defined(__CUDACC__)
// 
// // These macros become active when compiling CUDA code via nvcc. They should be used to embellish functions and methods
// // to make them available for use in the CUDA code.
// #define RS_CPU_GPU __forceinline__ __device__ __host__
// #define RS_GPU __forceinline__ __device__
//
// #else
// 
// // These macros become active when compiling regular host code with a regular compiler, i.e. not nvcc.
// #define RS_CPU_GPU
// #define RS_GPU
// 
// #endif

I put your suggestions in place:

  • Remove the constructor.
  • Prefix my functions with RS_CPU_GPU.

None of these worked, I am getting a similar error but not the same:

2024-01-24 10:33:20.388 ( 245.111s) [GpuWorker-1     ]  OptixRenderDevice.cpp:18     ERR| COMPILE ERROR: Malformed input. See compile details for more information.
Error: Taking the address of functions is illegal because indirect function calls are illegal. Address of function  is taken and used in function _ZN12kernelshared14FixedSizeStackIN6cgmath11RGBSpectrumELh32EEC1Ev (C:/Users/Thomas/dev/Rayscaper/src/cpp/kernelshared/textures/FixedSizeStack.h:12:7)

2024-01-24 10:33:20.389 ( 245.112s) [GpuWorker-1     ]    OptixIntegrator.cpp:116   FATL| FATAL: C:\Users\Thomas\dev\Rayscaper\src\cpp\kernelgpu\OptixIntegrator.cpp:116 (optixModuleCreate(optix_device_context_, &module_compile_options, &pipeline_compile_options, (const char*)embedded_optix_code, optix_ir_code_len, log_buffer, &log_buffer_size, &optix_module_)) failed with result: Invalid input [OPTIX_ERROR_INVALID
_INPUT]

The direct callable chain like this (WARN: wall of code):

extern "C" __device__ cgmath::Spectrum __direct_callable__eval_complex_spectrum_texture(
    const OptixSpectrumTexture& texture, const kernelshared::TextureEvalContext& ctx) {
  return ComplexSpectrumTextureEval(texture, ctx);
}

RS_CPU_GPU cgmath::Spectrum ComplexSpectrumTextureEval(const OptixSpectrumTexture& texture,
                                                       const kernelshared::TextureEvalContext& ctx) {
  return ComplexTexEval<cgmath::Spectrum>(texture, ctx, kOptixLaunchParams.spectrum_textures_);
}

// Implementation of evaluation of complex textures on the GPU. Our definition of a complex texture is any texture
// that can recursively contain other textures. For example the checkerboard texture.
//
// Recursive function calls aren't allowed on the GPU, hence we evaluate textures using fixed size stacks. First, we
// have an input stack, where we push all textures that still require evaluation. After evaluation, the texture writes
// its values to the output stack, textures can also consume the values from their children from the output stack.
//
// A limitation of this setup is that we can't mix texture types. For example, we cannot evaluate a float texture
// required to evaluate a Spectrum texture. This mainly cripples the usefulness of the mix texture node.
//
// We took a little inspiration from Cycles' SVM (Shader Virtual Machine) although our textures are a lot less complex
// to evaluate than some of Blender's textures.
template <typename TexelType>
RS_CPU_GPU inline TexelType ComplexTexEval(const OptixTexture<TexelType>& texture,
                                           const kernelshared::TextureEvalContext& ctx,
                                           rstd::span<const OptixTexture<TexelType>> textures) {
  using namespace kernelshared;

  struct StackItem {
    const OptixTexture<TexelType>* texture_;
    bool pushed_children_;
  };
  FixedSizeStack<StackItem, 32> input_stack;
  FixedSizeStack<TexelType, 32> output_stack;

  input_stack.Push({&texture, false});

  while (!input_stack.IsEmpty()) {
    const OptixTexture<TexelType>* current_texture = input_stack.Peek().texture_;
    const bool pushed_children = input_stack.Peek().pushed_children_;
    input_stack.Pop();

    switch (current_texture->GetType()) {
      case OptixTextureType::kCheckers2d: {
        if (pushed_children) {
          const TexelType odd = output_stack.Pop();
          const TexelType even = output_stack.Pop();
          const cgmath::Point2d st = current_texture->checkers_texture_.tex_mapping_.Map(ctx);
          output_stack.Push(EvalCheckers(st, even, odd));
        } else {
          input_stack.Push({current_texture, true});
          input_stack.Push({&textures[current_texture->checkers_texture_.odd_texture_index_], false});
          input_stack.Push({&textures[current_texture->checkers_texture_.even_texture_index_], false});
        }
        break;
      }
      case OptixTextureType::kConstant: {
        output_stack.Push(current_texture->const_texture_.value_);
        break;
      }
      case OptixTextureType::kImage: {
        auto image_coords = kernelshared::CalcImageCoordinate(current_texture->image_texture_.image_.GetImageSize(),
                                                              ctx,
                                                              current_texture->image_texture_.tex_mapping_,
                                                              current_texture->image_texture_.wrap_mode_);
        if (!image_coords.has_value()) {
          output_stack.Push(TexelType{});
        } else {
          output_stack.Push(kernelshared::TransformTexel<TexelType>(
              current_texture->image_texture_.image_.template GetValue<TexelType>(image_coords->x, image_coords->y),
              current_texture->image_texture_.inverse_gamma_correct_,
              current_texture->image_texture_.invert_));
        }
        break;
      }
      case OptixTextureType::kMix: {
        if (pushed_children) {
          const TexelType input_a = output_stack.Pop();
          const TexelType input_b = output_stack.Pop();
          float blend_amount = current_texture->mix_texture_.blend_amount_;
          output_stack.Push(cgmath::Lerp<TexelType>(blend_amount, input_a, input_b));
        } else {
          input_stack.Push({current_texture, true});
          input_stack.Push({&textures[current_texture->mix_texture_.input_a_texture_index_], false});
          input_stack.Push({&textures[current_texture->mix_texture_.input_b_texture_index_], false});
        }
        break;
      }
      case OptixTextureType::kNoise: {
        if (pushed_children) {
          const TexelType input_a = output_stack.Pop();
          const TexelType input_b = output_stack.Pop();
          const cgmath::Point3d uvw = current_texture->noise_texture_.texture_mapping_.Map(ctx);
          output_stack.Push(EvalNoise<TexelType>(current_texture->noise_texture_.noise_type_,
                                                 uvw,
                                                 current_texture->noise_texture_.omega_,
                                                 current_texture->noise_texture_.num_octaves_,
                                                 input_a,
                                                 input_b));
        } else {
          input_stack.Push({current_texture, true});
          input_stack.Push({&textures[current_texture->noise_texture_.input_b_texture_index_], false});
          input_stack.Push({&textures[current_texture->noise_texture_.input_a_texture_index_], false});
        }
        break;
      }
      case OptixTextureType::kNull: {
        output_stack.Push(GetFallbackValue<TexelType>());
        break;
      }
      case OptixTextureType::kUvTexture: {
        cgmath::Point2d st = current_texture->uv_texture_.tex_mapping_.Map(ctx);
        output_stack.Push(EvalUvTexture<TexelType>(st));
        break;
      }
      default: {
        // Unknown texture type -> bail out.
        return GetFallbackValue<TexelType>();
      }
    }

    // Bail out on a stack overflow.
    if (input_stack.IsFull() || output_stack.IsFull()) {
      return GetFallbackValue<TexelType>();
    }
  }

  return output_stack.Peek();
}

Also attached the Spectrum class implementation for completeness:
spectrum.txt (7.7 KB)

I will follow up to see if changing my driver or CUDA toolkit versions helps.

Regards,
Thomas

You must not use bin2c with --padd to embed OptiX-IR code! That breaks the binary input.
Unless you subtracted these four bytes from the module’s input size argument again.
https://forums.developer.nvidia.com/t/embedding-optix-ir/273199

There are two things to try:
1.) Does the OptiX IR code work with validation when not using the device code debug flag -G.

OptiX validation will complain when using

module_compile_options.optLevel = OPTIX_COMPILE_OPTIMIZATION_LEVEL_0;
module_compile_options.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_FULL;

without debug information. Just use the release mode settings for both:

  module_compile_options.optLevel = OPTIX_COMPILE_OPTIMIZATION_DEFAULT;
  module_compile_options.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_NONE;

2.) Does it work when using PTX code instead of OptiX IR code?

(Remove the --padd when all modules go into a different embedded uchar arrays anyway.)

There is this known issue which is still being worked on which might be related here:
https://forums.developer.nvidia.com/t/debuggable-optix-ir-makes-launching-a-pipeline-failed/274765

Experiment 1

When dropping the debug flag (-G) and using release mode settings:

  module_compile_options.optLevel = OPTIX_COMPILE_OPTIMIZATION_DEFAULT;
  module_compile_options.debugLevel = OPTIX_COMPILE_DEBUG_LEVEL_NONE;

That compiles fine and renders fine. These are my normal code settings. I’m not encountering issues with the code when using normal release mode.

Experiment 2

Use PTX code instead of optix-ir. Then I’m running into the same problem.

2024-01-24 14:58:19.037 ( 319.171s) [GpuWorker-1     ]  OptixRenderDevice.cpp:18     ERR| COMPILE ERROR: Malformed input. See compile details for more information.                            
Warning: Requested debug level "OPTIX_COMPILE_DEBUG_LEVEL_FULL", but input module does not include full debug information.                                                                     
Error: Found an indirect function call in _ZN12kernelshared14FixedSizeStackIN6cgmath11RGBSpectrumELh32EEC1Ev: C:/Users/Thomas/dev/Rayscaper/src/cpp/kernelshared/textures/FixedSizeStack.h:12:7

2024-01-24 14:58:19.037 ( 319.171s) [GpuWorker-1     ]    OptixIntegrator.cpp:116   FATL| FATAL: C:\Users\Thomas\dev\Rayscaper\src\cpp\kernelgpu\OptixIntegrator.cpp:116 (optixModuleCreate(optix_device_context_, &module_compile_options, &pipeline_compile_options, (const char*)embedded_optix_code, optix_ir_code_len, log_buffer, &log_buffer_size, &optix_module_)) failed with result: Invalid input [OPTIX_ERROR_INVALID
_INPUT]

OK, if that only happens in debug mode that could be related to the other problem where debug code and validation don’t work together.

Please make sure you change all function definitions to use the RS_CPU_GPU define and not only inline. That also happens in your spectrum definition. Also things like this shouldn’t be inside device code: std::string ToString() const;

If cleaning up all places doesn’t help, we would need a minimal and complete reproducer in failing state to make sure this gets fixed as well.
If that is the same or a similar issue inside OptiX, please use release device code for both release and debug targets until there is a fix available.

Food for thought on your StackItem definition:
OptiX performance and pipeline stack size depends on the rendering algorithm (number of recursions, traversal depth) and the local memory used by the called program domains. Latter becomes even more important with the Shader Execution Reordering feature on Ada generation GPUs.
This structure definition is not optimal for local memory usage:

 struct StackItem {
    const OptixTexture<TexelType>* texture_;
    bool pushed_children_;
  };

In CUDA that is a 64-bit pointer which must be 8-byte aligned and an uchar for the bool.
When putting that into an array, that will result in 7 bytes of padding after the bool member per stack element to make the next array element properly aligned again.
It would be better for local memory usage if you split these in into two different stacks with just the pointers and just the bools and take care to push and pop them in parallel.

1 Like

Please make sure you change all function definitions to use the RS_CPU_GPU define and not only inline. That also happens in your spectrum definition. Also things like this shouldn’t be inside device code: std::string ToString() const;

Yup, that makes sense. I’m updating my code.

I’m confused regarding the std::string ToString() const definition. This is host code (it’s not annotated with RS_CPU_GPU) so wouldn’t nvcc just ignore this and send that to the host compiler?

If cleaning up all places doesn’t help, we would need a minimal and complete reproducer in failing state to make sure this gets fixed as well. If that is the same or a similar issue inside OptiX, please use release device code for both release and debug targets until there is a fix available.

Will follow-up to see if this fixes things.

Food for thought on your StackItem definition:
OptiX performance and pipeline stack size depends on the rendering algorithm (number of recursions, traversal depth) and the local memory used by the called program domains. Latter becomes even more important with the Shader Execution Reordering feature on Ada generation GPUs.
This structure definition is not optimal for local memory usage:

Thanks for the tip! Agreed, it makes sense to refactor my code like this.

Regards,
Thomas

I was just grasping at straws with what could be responsible for these indirect function calls resp. the initial case with a pointer to a function.
I have not seen that problem myself but don’t mix device and host member functions in my classes which are also used on the device.

The issue might also be something else than the error inside the link to the other forum thread because the symptoms are different. That other debug with validation issue is still being worked on.
If cleaning up all function definitions doesn’t solve this, we would still require a minimal and complete reproducer in failing state to analyze the root cause of these cases.