Crash when applying instance transform in closest_hit

Hello.
I’m currently working on a depth camera simulator written in Optix. It provides a point cloud with coordinates either in World space or in Sensors space. In order to accelerate acceleration structure update when meshes are moved with rigid transforms, I switched form a simple geometrical acceleration structure (gas) for all meshes, to a two-level hierarchy acceleration structure (one gas and one instance per mesh with one global instance acceleration structure (ias).
When using identity matrices and meshes with vertices in world coordinates and NOT accessing the matrix in optix cuda kernel it works fine. But when providing meshes with vertices in Object coordinates and Object to World transforms in hos tbuild code, there is a crash in closest_hit when applying the matrix. The error message is: « an illegal memory access was encountered ».
Here is the cuda optix code (__compute__pointAttribute function is called by closest_hit entry function code).

__forceinline__ __device__ void getTransformObjectToWorld(float4* mW)
  {
    OptixTraversableHandle handle = optixGetTransformListHandle(0);

    const float4* tW = optixGetInstanceTransformFromHandle(handle);

    mW[0] = tW[0];
    mW[1] = tW[1];
    mW[2] = tW[2];
  }

  // Matrix3x4 * point. v.w == 1.0f
  __forceinline__ __device__ vec3f transformPoint(const float4* m, vec3f const& v)
  {
    vec3f r;

    r.x = m[0].x * v.x + m[0].y * v.y + m[0].z * v.z + m[0].w;
    r.y = m[1].x * v.x + m[1].y * v.y + m[1].z * v.z + m[1].w;
    r.z = m[2].x * v.x + m[2].y * v.y + m[2].z * v.z + m[2].w;

    return r;
  }

  // Matrix3x4 * vector. v.w == 0.0f
  __forceinline__ __device__ vec3f transformVector(const float4* m, vec3f const& v)
  {
    vec3f r;

    r.x = m[0].x * v.x + m[0].y * v.y + m[0].z * v.z;
    r.y = m[1].x * v.x + m[1].y * v.y + m[1].z * v.z;
    r.z = m[2].x * v.x + m[2].y * v.y + m[2].z * v.z;

    return r;
  }

  //------------------------------------------------------------------------------
  // closest hit programs
  //------------------------------------------------------------------------------
  
  __device__ void __compute__pointAttribute(
    const TriangleMeshSBTData& sbtData,
    const OptionalComputations& options,
    const IndexParams& indexParams,
    const cudaTextureObject_t& backgroundTexture,
    float dmin,
    float dmax,
    curandState* state,
    uint32_t rayIndex,
    const cudaTextureObject_t& noiseTexture,
    parRayData& prd)
  {
    // ------------------------------------------------------------------
   // gather some basic hit information
   // ------------------------------------------------------------------
    const unsigned int primID = optixGetPrimitiveIndex();
    unsigned int triangleID = (primID & indexParams.lowerMask);
    const float u = optixGetTriangleBarycentrics().x;
    const float v = optixGetTriangleBarycentrics().y;

    float4 objectToWorld[3];
    getTransformObjectToWorld(objectToWorld);

    // normal
    float dotrayDirNs;
    vec3f rayDir = optixGetWorldRayDirection();
    if (options.normals || options.reflectances || options.noise)
    {
      const vec3i indexNormal = sbtData.indexNormal[triangleID];
      vec3f NsO = ((1.f - u - v) * sbtData.normal[indexNormal.x]
        + u * sbtData.normal[indexNormal.y]
        + v * sbtData.normal[indexNormal.z]);
      vec3f Ns = transformVector(objectToWorld, NsO);

      dotrayDirNs = dot(rayDir, Ns);
      if (dotrayDirNs > 0.f)
      {
        Ns = -Ns; // face forward
      }
      else
      {
        dotrayDirNs = -dotrayDirNs;
      }
      if (options.normals)
      {
        Ns = normalize(Ns);
        prd.normalCoords = Ns;
      }
    }

    // colors
    if (options.colors)
    {
      vec4f interpolatedBackgroundImage = tex2D<float4>(backgroundTexture, prd.color.x, prd.color.y);
      prd.color = (vec3f)interpolatedBackgroundImage;
    }

    // reflectivity
    if (options.reflectances || options.noise)
    {
      const vec4f material = sbtData.material[triangleID];
      prd.reflectivity = material[0] * dotrayDirNs; // diffuse * dot prod (Lambert model)
    }

    const vec3i index = sbtData.index[triangleID];
    // intersection point
    vec3f surfPosO = (1.f - u - v) * sbtData.vertex[index.x]
      + u * sbtData.vertex[index.y]
      + v * sbtData.vertex[index.z];
    vec3f surfPos = transformPoint(objectToWorld, surfPosO);

    // possibly add noise to point position
    if (options.noise)
    {
      // compute relative distance
      const vec3f origin = optixGetWorldRayOrigin();
      float dist = length(surfPos - origin);
      float distRelative = (dist - dmin) / (dmax - dmin);
      float noiseAmplitude = tex2D<float>(noiseTexture, distRelative, prd.reflectivity);
      curandState localState = state[rayIndex];
      float normalizedNoise = curand_normal(&localState); // normal distribution
      /* Copy state back to global memory */
      state[rayIndex] = localState;
      rayDir = normalize(rayDir);
      surfPos = origin + (dist + noiseAmplitude * normalizedNoise) * rayDir; 
    }
    prd.coords = surfPos;
    prd.meshID = (int) ((primID & indexParams.upperMask) >> indexParams.shiftIndex);
  }

And the acceleration structure build code :

    OptixTraversableHandle ComputeGPU::buildAccel(
      const std::vector<std::shared_ptr<TriMesh>>& meshes, 
      const std::vector<Material>& materials)
    {
      PING;

      vertexBuffer.resize(meshes.size());
      indexBuffer.resize(meshes.size());
      normalBuffer.resize(meshes.size());
      indexNormalBuffer.resize(meshes.size());
      materialBuffer.resize(meshes.size());
      _d_gas.resize(meshes.size());

      OptixTraversableHandle asHandle{ 0 };

      // ==================================================================
      // triangle inputs
      // ==================================================================
      _triangleInput.resize(meshes.size());
      std::vector<CUdeviceptr> d_vertices(meshes.size());
      std::vector<CUdeviceptr> d_indices(meshes.size());
      std::vector<uint32_t> triangleInputFlags(meshes.size());

      for (int meshID = 0; meshID < meshes.size(); meshID++) {

        // upload the model to the device: the builder
        std::shared_ptr<TriMesh> mesh = meshes[meshID];
        vertexBuffer[meshID].alloc_and_upload(mesh->vertices());
        indexBuffer[meshID].alloc_and_upload(mesh->triangle);
        normalBuffer[meshID].alloc_and_upload(mesh->normals());
        indexNormalBuffer[meshID].alloc_and_upload(mesh->triangleNormal);
        std::vector<vec4f> meshMaterials(mesh->triangle.size());
        for (size_t i = 0; i < mesh->triangle.size(); ++i)
        {
          const Material& material = materials[mesh->materialId[i]];
          meshMaterials[i] = vec4f(material.diffuse, material.reflection, material.refraction, material.ior);
        }
        materialBuffer[meshID].alloc_and_upload(meshMaterials);

        _triangleInput[meshID] = {};
        _triangleInput[meshID].type
          = OPTIX_BUILD_INPUT_TYPE_TRIANGLES;

        // create local variables, because we need a *pointer* to the
        // device pointers
        d_vertices[meshID] = vertexBuffer[meshID].d_pointer();
        d_indices[meshID] = indexBuffer[meshID].d_pointer();

        _triangleInput[meshID].triangleArray.vertexFormat = OPTIX_VERTEX_FORMAT_FLOAT3;
        _triangleInput[meshID].triangleArray.vertexStrideInBytes = sizeof(vec3f);
        _triangleInput[meshID].triangleArray.numVertices = (int)mesh->verticesSize();
        _triangleInput[meshID].triangleArray.vertexBuffers = &d_vertices[meshID];

        _triangleInput[meshID].triangleArray.indexFormat = OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
        _triangleInput[meshID].triangleArray.indexStrideInBytes = sizeof(vec3i);
        _triangleInput[meshID].triangleArray.numIndexTriplets = (int)mesh->triangle.size();
        _triangleInput[meshID].triangleArray.indexBuffer = d_indices[meshID];

        // LIMITATION : offset has 30 bits (not 32 ?)
        unsigned int offset = (unsigned int)meshID;
        _triangleInput[meshID].triangleArray.primitiveIndexOffset = (offset << _shiftIndex);

        triangleInputFlags[meshID] = 0;

        // in this example we have one SBT entry, and no per-primitive
        // materials:
        _triangleInput[meshID].triangleArray.flags = &triangleInputFlags[meshID];
        _triangleInput[meshID].triangleArray.numSbtRecords = 1;
        _triangleInput[meshID].triangleArray.sbtIndexOffsetBuffer = 0;
        _triangleInput[meshID].triangleArray.sbtIndexOffsetSizeInBytes = 0;
        _triangleInput[meshID].triangleArray.sbtIndexOffsetStrideInBytes = 0;

        // build accel structure for each mesh
        OptixAccelBuildOptions accelBuildOptions = {};

        accelBuildOptions.buildFlags = OPTIX_BUILD_FLAG_NONE;
        accelBuildOptions.operation = OPTIX_BUILD_OPERATION_BUILD;
        accelBuildOptions.motionOptions.numKeys = 1;

        OptixAccelBufferSizes accelBufferSizes;

        OPTIX_CHECK(optixAccelComputeMemoryUsage(optixContext, &accelBuildOptions, &_triangleInput[meshID], 1, &accelBufferSizes));

        CUDABuffer d_tmp;
        d_tmp.alloc(accelBufferSizes.tempSizeInBytes);
        CUDABuffer d_gas; // This holds the geometry acceleration structure.
        d_gas.alloc(accelBufferSizes.outputSizeInBytes);

        OptixTraversableHandle traversableHandle = 0; // This is the GAS handle which gets returned.

        OPTIX_CHECK(optixAccelBuild(optixContext,
          /* stream */0,
          &accelBuildOptions,
          &_triangleInput[meshID],
          1,
          d_tmp.d_pointer(),
          d_tmp.sizeInBytes,
          d_gas.d_pointer(),
          d_gas.sizeInBytes,
          &traversableHandle,
          nullptr,
          0
        ));
        CUDA_SYNC_CHECK();

        d_tmp.free();
        _d_gas[meshID] = d_gas; // keep it in order to free it at the end

        // instance
        OptixInstance instance = {};
        Eigen::Affine3f transform = meshes[meshID]->H_world * Eigen::Scaling(meshes[meshID]->scale);
        Eigen::Matrix4f m = transform.matrix();
        const float instanceTransformation[12] =
        {
          m(0,0), m(0,1), m(0,2), m(0,3),
          m(1,0), m(1,1), m(1,2), m(1,3),
          m(2,0), m(2,1), m(2,2), m(2,3)
        };
        unsigned int id = static_cast<unsigned int>(_instances.size()); // meshID

        memcpy(instance.transform, instanceTransformation, sizeof(float) * 12);
        instance.instanceId = id;
        instance.visibilityMask = 255;
        instance.sbtOffset = id * RAY_TYPE_COUNT; // This controls the SBT instance offset!
        instance.flags = OPTIX_INSTANCE_FLAG_NONE;
        instance.traversableHandle = traversableHandle;

        _instances.push_back(instance);  

      } // end loop en meshes


      CUDABuffer d_instances;
      d_instances.alloc_and_upload(_instances);

      OptixBuildInput instanceInput = {};

      instanceInput.type = OPTIX_BUILD_INPUT_TYPE_INSTANCES;
      instanceInput.instanceArray.instances = d_instances.d_pointer();
      instanceInput.instanceArray.numInstances = (unsigned int)_instances.size();

      OptixAccelBuildOptions accelInstanceBuildOptions = {};

      accelInstanceBuildOptions.buildFlags = OPTIX_BUILD_FLAG_NONE; // OPTIX_BUILD_FLAG_ALLOW_UPDATE; 
      accelInstanceBuildOptions.operation = OPTIX_BUILD_OPERATION_BUILD;

      OptixAccelBufferSizes iasBufferSizes = {};
      OPTIX_CHECK(optixAccelComputeMemoryUsage(optixContext, &accelInstanceBuildOptions, &instanceInput, 1, &iasBufferSizes));
      _d_ias.alloc(iasBufferSizes.outputSizeInBytes);
      CUDABuffer d_tmp;
      d_tmp.alloc(iasBufferSizes.tempSizeInBytes);

      OPTIX_CHECK(optixAccelBuild(optixContext, 0/*m_cudaStream*/,
        &accelInstanceBuildOptions, &instanceInput, 1,
        d_tmp.d_pointer(), iasBufferSizes.tempSizeInBytes,
        _d_ias.d_pointer(), iasBufferSizes.outputSizeInBytes,
        &asHandle, nullptr, 0));
      CUDA_SYNC_CHECK();

      d_tmp.free();
      d_instances.free();

      return asHandle;
    }

Rmk : When replacing the line :

vec3f surfPos = transformPoint(objectToWorld, surfPosO);

with a simple

vec3f surfPos = surfPosO;

There is no crash (although the result is incorrect).

Rmk 2 : The initial code (a global gas) was inspired from the series of Optix 7 sample code in Siggraph 2019. And the adaptation to a two level hierarchy acceleration structure comes from Optix adavanced samples, more specifically from the intro_runtime application.

Thanks for your help… that’s several days I’m completly stuck.

If the code wasn’t using IAS before, have you changed the traversableGraphFlags from OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_GAS to the now required OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING?

Otherwise optixGetTransformListHandle(0) can’t work. If optixGetTransformListSize() == 0, accessing the zeroth element would crash.

Have you calculated the OptiX pipeline stack space yourself?
I would recommend to always do that yourself. It’s strictly required when using callable programs in OptiX.
Search the forum for optixPipelineSetStackSize for explanations and example code.

Code comments

    // PERF No need to split these. Use a float2.
    const float u = optixGetTriangleBarycentrics().x; 
    const float v = optixGetTriangleBarycentrics().y;
...
      // BUG Normals are not handled like vectors. They must be transformed by the inverse transpose.
      // BUG The result is not normalized when there are scaling transforms.
      vec3f Ns = transformVector(objectToWorld, NsO); 
...
  // PERF The optixGetWorldRayDirection(); is usually normalized, or at least it's good style to only use normalized ray directions or intersection programs could work differently than you expected.
  rayDir = normalize(rayDir); 

The github links in this post show how I’m doing the transforms:
https://forums.developer.nvidia.com/t/understanding-optixtransformnormalfromobjecttoworldspace/285169/2

Thanks a lot for your help !
With traversableGraphFlags change it now works perfectly.
I also updated Optix pipeline stack space and corrected / improved the cuda optix kernel.
Thanks again.

I’m not sure if the OptiX validation mode catches that, but it’s a good idea to enable it during debugging with a logging callback.

Note that OptiX validation mode adds synchronizations and costs a lot of performance, so never benchmark or release applications with validation mode enabled!

Example code here: https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/MDL_renderer/src/Device.cpp#L296

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.