InvalidAddressSpace when using pointer from Continuation Callable parameters

I’m new to Optix and I’m trying to write an renderer which use Continuation Callable to recursively trace an ray. To record the trace process, I pass an pointer to the callable like this: (and this pointer should go throughout the whole process)

DR::SamplePRD *prd = getPRD();
optixContinuationCall<void, DR::SamplePRD *, int>(mat_id, prd, mat_instance_id);

and the callable function definition is:

PREFIX void __continuation_callable__NAME(DR::SamplePRD *prd, int instance_id)

I then tried to access the memory the prd variable points to like this:

if (prd->log) {DO_SOMETHING}

but I got CUDA Error cudaErrorInvalidAddressSpace (operation not supported on global/shared address space)., telling me that this is not the correct way to do the thing.

I then print out the value of pointer using printf("%u", prd), and they do points to the same place.

Is this a bug or am I doing something wrong?

My Environment:

  • Optix 7.3
  • CUDA 11.3
  • Windows 10 21H1
  • RTX 3060 Laptop

That would require some more information, or a minimal and complete reproducer with the failing device code.
Not sure what is going on there, like what getPRD() does, what PREFIX is, or (guessing) if that has anything to do with namespaces, etc.

The CUDA runtime API docs say:
cudaErrorInvalidAddressSpace = 717
While executing a kernel, the device encountered an instruction which can only operate on memory locations in certain address spaces (global, shared, or local), but was supplied a memory address not belonging to an allowed address space. This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

So the main question is, to what memory type points your getPRD() result?

I’m using a pointer to my PRD structure which is allocated inside the ray generation program as local struct and move a pointer to that around between the non-callable program domains via two of the optixTrace payload registers and then use the pointer to that inside a direct callable for read/write here for example:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/bxdf_diffuse.cu#L67
This is called from the closet hit program like this:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/closesthit.cu#L245
That should work with continuation callables just that same.

I would generally recommend avoiding continuation callables if possible, simply for performance reasons.
While using callables helps reducing the overall device code, it’s usually not resulting in the fastest runtime.
The performance order is like this: inlined code > direct callable > continuation callable.

When using callables I normally calculate data inside direct callables and trace the ray with that data in an optixTrace afterwards.

You should also prefer iterative over recursive algorithms to save on the required stack space you need to calculate for each OptixPipeline.
Calculating the correct stack size yourself is mandatory when using direct or continuation callables.

1 Like

Thanks for replying.

In my case, the getPRD() is just copied from the SDK files:

static __forceinline__ __device__ DR::SamplePRD *getPRD()
{
    const unsigned int u0 = optixGetPayload_0();
    const unsigned int u1 = optixGetPayload_1();
    return reinterpret_cast<DR::SamplePRD *>(unpackPointer(u0, u1));
}

and the full function definitioin is just a simple

extern "C" __global__ void __continuation_callable__solid_color_material(DR::SamplePRD *prd, int instance_id)

the hit program is:

extern "C" __global__ void __closesthit__triangle_mesh()
{
    printf("%d %d\n", optixGetPayload_0(), optixGetPayload_1());
    DR::SamplePRD *prd = getPRD();
    if (prd->log)
    {
        printf("hit program launched\n");
    }
    // retrive trace data
    DR::HitData *sbt_record = reinterpret_cast<DR::HitData *>(optixGetSbtDataPointer());
    DR::TriangleMeshData *sbt_data =
        reinterpret_cast<DR::TriangleMeshData *>(sbt_record->mesh_payload);
    const int prim_idx = optixGetPrimitiveIndex();
    const float3 ray_dir = optixGetWorldRayDirection();
    const float3 hit_pos = optixGetWorldRayOrigin() + optixGetRayTmax() * ray_dir;
    const int vert_idx_offset = prim_idx * 3;
    const int mat_id = sbt_data->mat_id;
    const int mat_instance_id = sbt_data->mat_instance_id;
    const DR::Vertex v0 = sbt_data->vertexes[vert_idx_offset + 0];
    const DR::Vertex v1 = sbt_data->vertexes[vert_idx_offset + 1];
    const DR::Vertex v2 = sbt_data->vertexes[vert_idx_offset + 2];

    // REVIEW: how to pass hit infomation into material shader ?
    if (prd->log)
    {
        printf("hit data loaded\n");
        printf("mat_id=%d, mat_instance_id=%d\n", mat_id, mat_instance_id);
        printf("current prd: %u\n", prd);
    }
    prd->tracker_tail->hit = true;
    prd->tracker_tail->hit_pos = hit_pos;
    optixContinuationCall<void, DR::SamplePRD *, int>(mat_id, prd, mat_instance_id);
    if (prd->log)
    {
        printf("hit program ended\n");
    }
}

the continuation callable program is:

#include "SolidColorMaterial.cuh"

extern "C" __global__ void __continuation_callable__solid_color_material(DR::SamplePRD *prd,
                                                                         int instance_id)
{
    printf("mat program get prd: %u\n", prd);
    
    if (prd->log) // this is when things go wrong
    {
        printf("material program launched\n");
    }

    DR::CallableData *sbt_record = reinterpret_cast<DR::CallableData *>(optixGetSbtDataPointer());
    DR::SolidColorMaterialData *data =
        reinterpret_cast<DR::SolidColorMaterialData *>(sbt_record->callable_payload);
    data = data + instance_id;
}

the original PRD is allocated inside the sample loop of raygen program and passed to hit program like this:

for (size_t s = 0; s < params.samples_per_launch; s++)
    {
        // generate ray
        float2 subpixel_jitter = make_float2(curand_uniform(&rng), curand_uniform(&rng)) - 0.5;
        const float2 d = make_float2((static_cast<float>(launch_index.x) + subpixel_jitter.x) /
                                         static_cast<float>(params.width),
                                     (static_cast<float>(launch_index.y) + subpixel_jitter.y) /
                                         static_cast<float>(params.height));
        float3 ray_orig, ray_dir;
        ray_dir = normalize(d.x * camera_data->U + d.y * camera_data->V + camera_data->W);
        ray_orig = camera_data->orig;

        /* ---------------------------------- DEBUG --------------------------------- */
        ray_dir = make_float3(-1, 0, 0);
        /* ---------------------------------- DEBUG --------------------------------- */

        // init prd
        DR::SamplePRD prd;
        prd.log = log;
        prd.handle = params.handle;
        prd.ray_occlusion_test = false;
        prd.curr_depth = 0;
        prd.max_depth = params.max_depth;
        prd.sample_idx = s;
        prd.rdn_seed = &rng;
        // init tracker
        DR::TraceTrackerNode *tracker = initTrackerNode(params.num_fbos);
        prd.tracker_head = tracker;
        prd.tracker_tail = tracker;
        tracker->ray_orig = ray_orig;
        tracker->ray_dir = ray_dir;
        // pack prd pointer
        unsigned int u0, u1;
        packPointer(&prd, u0, u1);

        // trace
        optixTrace(params.handle, // handle
                   ray_orig, ray_dir,
                   1e-5f, // T min
                   1e10f, // T max
                   0.0f,  // T curr
                   OptixVisibilityMask(1),
                   OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT, // ray flag
                   0u,                                    // SBT offset
                   1u,                                    // SBT stride
                   0u,                                    // miss SBT index
                   u0, u1);

}

I may try to refactor those code to use DC for better performance, but at this time I just want something runable 🤣

I’m unable to reproduce this when changing all my direct callables to continuation callables in one of my OptiX 7 examples (intro_driver) from the above links.
Works without issues either way, just slower with continuation callables.

My system is Windows 10 20H2, RTX A6000, 471.41 display drivers, OptiX 7.3.0, CUDA 11.1, MSVS 2019, PTX code target SM 5.0 (Maxwell, is deprecated in CUDA 11) or SM 6.0 (Pascal) both worked.

I’m explicitly not using CUDA 11.3 because OptiX 7.3.0 is built with CUDA 11.1.

Means you could try the following steps to match that:

1.) Build your application with CUDA 11.1.

You can install as many CUDA toolkits side-by-side on a system as you want. I usually use the web-based installer and disable everything related to the display driver in there because that’s always out-of-date as soon as there are official display driver releases which support that CUDA version.
Make sure to set the CUDA_PATH to the version you want to use. That’s usually used to find the current CUDA toolkit version inside CMake scripts. If you rely on CUDA binaries to be in the PATH environment variable, make sure the order is the one you want.

2.) Update the display driver to 471.41 if you’re not running that already.

3.) If that still doesn’t work, please provide a complete and minimal reproducer project in failing state.
If you can make one of the existing OptiX 7.3.0 SDK examples fail the same way, that would be the easiest to analyze.

PS: If your optixTrace call above is not for a visibility ray but should find the closest hit, then do not set OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT. That will return the first hit during traversal, not the closest!

I managed to reproduce this error using optixPathTracer Sample in the SDK. Here’s my modifications:

In file optixPathTracer.cpp:

  1. in method void createProgramGroups( PathTracerState& state ) add program group creation
    {
        OptixProgramGroupDesc cc_prog_group_desc = {};
        cc_prog_group_desc.kind = OPTIX_PROGRAM_GROUP_KIND_CALLABLES;
        cc_prog_group_desc.callables.moduleCC = state.ptx_module;
        cc_prog_group_desc.callables.entryFunctionNameCC = "__continuation_callable__test";
        sizeof_log = sizeof(log);
        OPTIX_CHECK_LOG(optixProgramGroupCreate(
            state.context,
            &cc_prog_group_desc,
            1,  // num program groups
            &program_group_options,
            log,
            &sizeof_log,
            &state.cc_group
        ));
    }
    
  2. in method void createPipeline( PathTracerState& state ) link CC to pipeline:
    OptixProgramGroup program_groups[] =
    {
        state.raygen_prog_group,
        state.radiance_miss_group,
        state.occlusion_miss_group,
        state.radiance_hit_group,
        state.occlusion_hit_group,
        state.cc_group
    };
    
     /* -------------------------------------------- */
    
    OPTIX_CHECK( optixUtilAccumulateStackSizes( state.cc_group, &stack_sizes ) );
    uint32_t max_trace_depth = 2;
    uint32_t max_cc_depth = 1;
    uint32_t max_dc_depth = 1;
    
  3. in method void createSBT( PathTracerState& state ) describe an simple CCRecord
    typedef Record<int> CCRecord;
    
    CUdeviceptr  d_cc_records;
    const size_t cc_record_size = sizeof(CCRecord);
    CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_cc_records), cc_record_size));
    
    CCRecord cc_sbt[1];
    OPTIX_CHECK(optixSbtRecordPackHeader(state.cc_group, &cc_sbt[0]));
    cc_sbt[0].data = -1;
    
    CUDA_CHECK(cudaMemcpy(
        reinterpret_cast<void*>(d_cc_records),
        cc_sbt,
        cc_record_size,
        cudaMemcpyHostToDevice
    ));
    
    /* -------------------------------------------------- */
    
    state.sbt.callablesRecordBase = d_cc_records;
    state.sbt.callablesRecordStrideInBytes = static_cast<uint32_t>(cc_record_size);
    state.sbt.callablesRecordCount = 1;
    
  4. for convenience, I set the window size to 1*1 in main() function
    int main( int argc, char* argv[] )
    {
        PathTracerState state;
        state.params.width                             = 1;
        state.params.height                            = 1;
    

Then, in file optixPathTracer.cu, I add a CC function and a simple call in CH function:

/* ----------------------------------------- */
    // ORIGINAL CH FUNCTION

    printf("prd is %u\n", prd);
    optixContinuationCall<void, RadiancePRD*, int>(0, prd, 0);
}


extern "C" __global__ void __continuation_callable__test(RadiancePRD * prd, int var)
{
    printf("prd is %u\n", prd);
    printf("prd->countEmitted is %d\n", prd->countEmitted);
}

I rebuild the project in VS and run Local Windows Debugger, and the problem is there.

Console output are as follows:

[ 4][       KNOBS]: All knobs on default.

[ 4][  DISK CACHE]: Opened database: "C:\Users\MartinZHe\AppData\Local\NVIDIA\OptixCache\cache7.db"
[ 4][  DISK CACHE]:     Cache data size: "179.9 MiB"
[ 4][   DISKCACHE]: Cache hit for key: ptx-47397-keya00b244ceb763997a6d032311bb639bf-sm_86-rtc1-drv471.41
[ 4][COMPILE FEEDBACK]:
[ 4][COMPILE FEEDBACK]: Info: Pipeline has 1 module(s), 5 entry function(s), 2 trace call(s), 1 continuation callable call(s), 0 direct callable call(s), 18 basic block(s) in entry functions, 699 instruction(s) in entry functions, 0 non-entry function(s), 0 basic block(s) in non-entry functions, 0 instruction(s) in non-entry functions

prd is 2063596632
prd is 2063596632
CUDAOutputBuffer destructor caught exception: CUDA call (cudaFree( reinterpret_cast<void*>( m_device_pixels ) ) ) failed with error: 'operation not supported on global/shared address space' (C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.3.0\SDK\sutil/CUDAOutputBuffer.h:139)

Caught exception: CUDA call (cudaStreamSynchronize( m_stream ) ) failed with error: 'operation not supported on global/shared address space' (C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.3.0\SDK\sutil/CUDAOutputBuffer.h:265)

C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.3.0\SDK\build\bin\Debug\optixPathTracer.exe (process 31172) exited with code 1.

full source files:

optixPathTracer.cpp (42.3 KB)
optixPathTracer.cu (11.6 KB)

Environment:

  • CUDA 11.3 & 11.1
  • Optix 7.3
  • Windows 21H1
  • GPU Driver 471.41

I also do some experiment with DC program with the sample optixCallablePrograms and it turns out that DC calls with pointer as parameter works as expect.

In both case I pass an pointer of a float3 variable allocated in hit program and tried to print the content of that variable inside callables. The DC program works just fine and CC program failed with the same error above.

New files that I’ve changed:

optixPathTracer.cu (modify & add few lines in the hit & cc function)

optixCallablePrograms.cu (modify & add few lines in the hit & dc function)

optixCallablePrograms.cu NEW (I forgot to add the new argtype into DC call template last time, but it strangely works without any warning / error)

Result:

optixPathTracer (CC)

optixCallablePrograms (DC)


A college of mine tested the original program on Windows 10, CMake 3.21.0, CUDA 11.4, Optix 7.3.0 and VS 2019 also result in the same error.

I found that if I modify only the .cu file in optixCallablePrograms Sample and not touch the configuration, the problem wont appear. But I do managed to reproduce the problem using optixPathTracing Sample and those modifications looks normal…

Update:

After refactoring some of my code with DC program, the problem still occurs.

Maybe it is some configuration problem?

Ok, thanks a lot for the optixPathTracer reproducer.
I can see the same error with the first two sources optixPathTracer.cpp/.cu on my test configuration and will file a bug report against OptiX.

I also tried without the curand.h include, matched the debugLevel of the OptixModuleCompileOptions and OptixPipelineLinkOptions to both use OPTIX_COMPILE_DEBUG_LEVEL_LINEINFO (that’s incorrect in the original example), and fixed the warnings of the printf("prd is %u\n", prd)you added. Pointers are 64-bit and printed with %p.
No change to the failure.

Actually scrap that.

The issue is an incorrect definition of the callable:
extern "C" __global__ void __continuation_callable__test(RadiancePRD * prd, int var)

I missed that detail in your initial post’s code excerpts. __global__ is only used for the other program domains, callables are device side functions and must be declared with __device__.

Changing your code to this makes it work:
extern "C" __device__ void __continuation_callable__test(RadiancePRD * prd, int var)

1 Like

Ahh, that’s really a stupid mistake… thanks a lot for your help.

No problem.
I hadn’t seen that CUDA error in that context before so next time I know immediately.

Still, I’d recommend trying to avoid continuation callables if you do not absolutely need them.
They are definitely slower than inlined code (everything is) or direct callables.

I have another problem after changing the definition to __device__.

I got COMPILE ERROR: No functions with semantic types found when calling optixModuleCreateFromPTX on that file.

Here’s the source file:

#include "SolidColorMaterial.cuh"

extern "C" __device__ void __continuation_callable__solid_color_material(DR::SamplePRD *prd,
                                                                         int instance_id)
{
    printf("mat program get prd: %u\n", prd);
    printf("prd->log=%d", prd->log);

    if (prd->log)
    {
        printf("material program launched\n");
    }

    DR::CallableData *sbt_record = reinterpret_cast<DR::CallableData *>(optixGetSbtDataPointer());
    DR::SolidColorMaterialData *data =
        reinterpret_cast<DR::SolidColorMaterialData *>(sbt_record->callable_payload);
    data = data + instance_id;

    printf("num mat instances: %d\n", sbt_record->n_instances);
    print_float3(data->radiance);
    printf("\n");

    // prd->tracker_tail->pixel_buffers[0] = make_float4(data->radiance, 1.0);
    // if (prd->log)
    // {
    //     printf("material program ended\n");
    // }
}

/**
 * @brief generate a new tracker and append to the end
 *
 * @param prd
 * @param instance_id
 */
static __forceinline__ __device__ void rayGeneration(DR::SamplePRD *prd, int instance_id)
{
}

/**
 * @brief shade the ray using result from traced next tracker
 *
 * @param tracker
 * @param instance_id
 */
static __forceinline__ __device__ void shadeRay(DR::TraceTrackerNode *tracker, int instance_id)
{
}

extern "C" __device__ void __direct_callable__solid_color_material(DR::SamplePRD *prd,
                                                                   DR::TraceTrackerNode *tracker,
                                                                   int instance_id,
                                                                   DR::MaterialCallMode mode)
{
    printf("dc launched width payload %u, tracker %u\n", prd, tracker);
    if (prd->log)
    {
        printf("111\n");
    }

    if (mode == DR::MaterialCallMode::RAY_GENERATION)
    {
        rayGeneration(prd, instance_id);
    }
    else if (mode == DR::MaterialCallMode::SHADE)
    {
        shadeRay(tracker, instance_id);
    }
    printf("dc ended\n");
}

Changing one of the callables’ domain to __global__ will help to pass the module creation stage, but will raise an OPTIX_ERROR_INVALID_VALUE when I create program group afterwards.

It seems that the program is not recognizing functions with __device__ domain.

Full includes are as follows:

#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <curand_kernel.h>
#include <vector_functions.h>
#include <vector_types.h>

#include <optix_device.h>
#include <optix_types.h>

#include "data_structure/BackgroundData.h"
#include "data_structure/CameraData.h"
#include "data_structure/LaunchDS.h"
#include "data_structure/LightData.h"
#include "data_structure/MaterialData.h"
#include "data_structure/MeshData.h"
#include "data_structure/Records.h"

#include "helper_math.h"

Ahh…another stupid mistake.

I drop some NVRTC compile options… Thanks anyway…

Seriously, code excerpts won’t help here when they are not a complete and minimal reproducer.
(Nobody except you can compile what you posted because it’s missing all necessary headers. Stop doing that.)

That issue is most likely because you’re not using the required CUDA compiler options.

If you compile a module which only consists of device functions and not a single call to any of them, all that is eliminated as dead code with default CUDA compiler settings.
You’d need to use --keep-device-functions (not available in NVRTC) or --relocatable-device-code=true to let the CUDA compiler keep these device functions.

Have a look through these forum search results:
https://forums.developer.nvidia.com/search?q=relocatable-device-code%20category%3A167

I’m not sure if it’s enough to add a global dummy function like
extern "C" __global__ void __raygen__dummy() {}
which you then don’t use. I never needed to do that.

I can really recommend looking also into my OptiX 7 examples and the compilation framework I’m using.
Those are only using the OptiX SDK headers and nothing of the SDK examples

These are the NVCC options I’m using:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/CMakeLists.txt#L168
for all my CUDA files, including the ones with just callable programs:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/bxdf_diffuse.cu
No issues when loading these here:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/src/Device.cpp#L667

Sorry for my stupid question, thanks a lot for your help.

It’s fine. I wasn’t fast enough.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.