Instance pointer build input

Hi,

I am having trouble building an AS using an instance build input that uses pointers to OptixInstance structures (so with OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS). Synchronizing the device after calling optixAccelBuild gives me an illegal memory access error. I know the array of pointers itself must be on the device, as well as the OptixInstance structures the pointers point to.

The OptixInstances are part of a larger struct S, and I have an array of these structs S in device memory. On the device, I get a pointer to the instance variable of the structs S. I fill the build array with these pointers. Perhaps something goes wrong with memory alignment, since the pointers need to be multiples of OPTIX_INSTANCE_BYTE_ALIGNMENT. To ensure this, I placed the instance variable to be the first variable in S, and aligned the array of structs S with the macro from this post. Is this the correct way to ensure this alignment?

I have tried finding an example online to compare my code to, but can’t seem to find any. Are there available examples for building ASes from instance build inputs that use pointers?

Kind regards,

Nol

Except for internal unit tests, I haven’t found a public example using OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS either.

This API reference explains the alignment requirements for OptixInstances and the arrays inside the build input:
https://raytracing-docs.nvidia.com/optix7/api/html/struct_optix_build_input_instance_array.html

CUdeviceptr OptixBuildInputInstanceArray::instances

If OptixBuildInput::type is OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS instances and aabbs should be interpreted as arrays of pointers instead of arrays of structs.
This pointer must be a multiple of OPTIX_INSTANCE_BYTE_ALIGNMENT if OptixBuildInput::type is OPTIX_BUILD_INPUT_TYPE_INSTANCES.
The array elements must be a multiple of OPTIX_INSTANCE_BYTE_ALIGNMENT if OptixBuildInput::type is OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS.

That is pretty clear about the alignment requirements. (With OPTIX_INSTANCE_BYTE_ALIGNMENT == 16ull):

When using an array of OptixInstances, then the device pointer to the array needs to be 16 byte aligned.
Since the OptixInstance struct is padded to an 80 bytes size manually, all OptixInstance elements in that array are 16 byte aligned.

If you’re using an array of pointers to OptixInstances, then each pointer in that array must point to a 16 byte aligned device address because the OptixInstance needs to be 16 byte aligned.

A CUdeviceptr itself is 64 bit and needs to be at 8 byte aligned.

Either alignment of the build input instances or instance pointer arrays shouldn’t be a problem when allocating the memory with cudaAlloc() or cuMemAlloc() which are at least 256 byte aligned.

So in your case you first need to make sure that the individual pointers to the OptixInstances are all aligned to 16 bytes.
Just add an assert((device_pointer & 15ull) == 0) to all your individual OptixInstance pointers in your build input array.
If that fires inside the debugger, you need to place the OptixInstance field in your own structures at a properly aligned offset and potentially pad your structure’s size.

Since the OptixInstance itself doesn’t have an __align__(OPTIX_INSTANCE_BYTE_ALIGNMENT)(which I think should have been added inside the OptiX SDK) that might have been placed at a misaligned offset in your structure for the first or later elements.
You can use that __align__ to let the compiler automatically place that in your own structures, but beware of additional padding inside the struct.
There are many examples inside the OptiX SDK examples which use that for the Shader Binding Table record structures.

My approach for device side structures is to order their fields by CUDA alignment restrictions from big to small and pad them manually to the largest alignment needed in a struct.
https://forums.developer.nvidia.com/t/preferred-alignment-for-buffers/107532/2
https://forums.developer.nvidia.com/t/optic-7-passing-multiple-ray-data-to-closesthit-program/160005/4
The compilers will normally handle the alignment for built-in types, but this also makes sure there is no inadvertent padding added between fields inside the structure to make them as small as possible.

Hi, thanks for your answer. To investigate further I have refactored my code such that the OptixInstance structs are no long a part of another struct. I now have an array of OptixInstance structs (d_instances) and an array of pointers to OptixInstance structs (d_instance_pointers), that both are allocated with separate cudaMalloc calls. My program generates data in d_instances. I then have a small CUDA kernel which fills the array of pointers as follows:

d_instance_pointers[thread_id] = &d_instances[thread_id];

of course, a check is in place that ensures thread_id does not exceed the size of the array. If I have read your answer correctly, this setup should not give any alignment issues (separate allocation calls, separate arrays). But, the same error remains when I use OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS and d_instance_pointers.

Interestingly, no error is thrown when I call optixAccelBuild with OPTIX_BUILD_INPUT_TYPE_INSTANCES and d_instances.

Do you know what else might be going wrong here?

The allocation calls are:

CUdeviceptr d_instance_pointers = 0;
cudaMalloc(reinterpret_cast<void **>(&d_instance_pointers), count * sizeof(CUdeviceptr));

and

CUdeviceptr d_instances = 0;
cudaMalloc(reinterpret_cast<void **>(&d_instances), count * sizeof(OptixInstance));

I reinterpret cast the device pointers to OptixInstance ** and OpitxInstance * respectively to pass them to the OptiX program and the CUDA kernel.

Hi Nol,

I just tried switching a sample that uses instances to use instance pointers. It ran okay. That might be something you can do to compare against your own code, try converting an OptiX SDK sample like optixCutouts or optixHair to use instance pointers.

One thing I did differently is I constructed the instance pointers on the host using pointer arithmetic rather than making a CUDA kernel, like so:

    CUdeviceptr  d_instance_pointers;
    std::vector<CUdeviceptr> optix_instance_pointers;
    for( int i = 0; i < numInstances; i++ )
        optix_instance_pointers.push_back( d_instances + i * sizeof( OptixInstance ) );
    createOnDevice(optix_instance_pointers, &d_instance_pointers);

This lazy approach makes it a tiny bit easier to double-check the alignment of each pointer, on the host.

A few other things you can try:

  • Make sure to also call cudaDeviceSyncronize before optixAccelBuild in case there was a pending error before the launch.
  • Try turning on OptiX validation mode. (see OPTIX_DEVICE_CONTEXT_VALIDATION_MODE_ALL)
  • Check to make sure all relevant device buffers are copied to the GPU before optixAccelBuild runs; it’s easy sometimes to forget or to have the copy code running at a different time than you think.

–
David.

Hi David,

I tried converting the OptiX SDK sample optixHair. I replaced lines 221-223 which are

buildInput.type                       = OPTIX_BUILD_INPUT_TYPE_INSTANCES;
buildInput.instanceArray.instances    = deviceInstances;
buildInput.instanceArray.numInstances = static_cast<unsigned int>( instances.size() );

by

CUdeviceptr  d_instance_pointers;
std::vector<CUdeviceptr> optix_instance_pointers;
for( int i = 0; i < instances.size(); i++ )
    optix_instance_pointers.push_back( deviceInstances + i * sizeof( OptixInstance ) );
createOnDevice(optix_instance_pointers, &d_instance_pointers);

buildInput.type                       = OPTIX_BUILD_INPUT_TYPE_INSTANCE_POINTERS;
buildInput.instanceArray.instances    = d_instance_pointers;
buildInput.instanceArray.numInstances = static_cast<unsigned int>( instances.size() );

running this results in:

CUDA call (cudaFree( reinterpret_cast<void*>( m_buffer ) ) ) failed with error: 'unspecified launch failure' (D:\OptiX S
DK 7.3.0-copy\SDK\optixHair\Head.cpp:197)

Without the change, the code runs just fine and does not throw this error. I checked the alignment of the pointers with

assert(((deviceInstances + i * sizeof(OptixInstance)) & 15ull) == 0);

which succeeds without a problem (which it should, since deviceInstances is an array of OptixInstance structs and is created with a single cudaMalloc call).

Could you please provide the following system configuration information to be able to investigate further:
OS version, installed GPU(s), VRAM amount, display driver version, OptiX (major.minor.micro) version, CUDA toolkit version (major.minor) used to generate the input PTX, host compiler version.
(These are all mandatory for any OptiX problem reports.)

That the launch failure is caught in the class Head destructor is much too late if there was a launch failure with the optixAccelBuild or the optixLaunch calls during rendering. There a many CUDA_CHECK resp. CUDA_SYNC_CHECK calls before that.

So what actually happens when running this inside the debugger? Is it rendering and only fails on exit?

Hi, thanks for looking into this. Here is my system configuration information:

OS version:

Edition Windows 10 Education
Version 21H1
OS build 19043.1110
Experience Windows Feature Experience Pack 120.2212.3530.0

Installed GPU:
NVIDIA GeForce RTX 2070

VRAM amount:
8192 MB

Display driver version:
Game Ready Driver Version 471.41

OptiX version:
7.3.0

CUDA toolkit version:
11.1

Host compiler version:
MSVC 19.29.30037.0

I assume you mean the host debugger (I do not know how to debug the device code for this). The error I stated in my previous reply is what happens when I run it inside the debugger (sorry, should have specified that). The rendering window is not opened at all, only some information is printed in the console. Running without the debugger gives the very same result (no window opens and that error is reported), both on a debug and a release build.

Thanks!
David is still investigating.

In the meantime, as a workaround I would recommend using the OptixInstance arrays directly.
Since that needs to be in device memory anyway, an additional array of pointers to that will just use more device memory and add an unnecessary indirection.
The only time this useful is when not packing the OptixInstances into their own build input arrays, or when having a huge amount of OptixInstance elements of which only a changing subset should be used for repeated/different instance acceleration structure builds.

1 Like