CUDA error code 700, illegal memory access in call to optixAccelBuild

I have a c++ program that is failing with this error in a call to optixAccelBuild when I call it to build the Optix instance acceleration structure (2nd call to optixAccelBuild in this function). This function is similar to some working code, and I tried to debug this for a few days with no luck. I’ve obviously done something wrong, but no idea what.

The entire program i9s too big to post, so I’m just posting the failing function.

The mesh I’m trying to process is a simple cube, 8 vertices, and 12 triangular faces. I’m using an index buffer for the faces.

I’ve copied the vertex and index arrays to the GPU before this function is called.

The ‘f_*’ functions are simple wrappers around the corresponding CUDA and OPTIX functions that call the real function, check completion status, and abort on any error, so error checking should be complete.

cudaMallocTracked and cudaFreeTracked are wrappers around cudaMalloc and cudaFree that keep a list of GPU memory allocations so that the code can free any unreleased GPU memory when the thread exits.

The CUDA doumentation says that a cudaMemcpy from pageable host memory to GPU memory may return before the dma transfer to GPU is complete, so I put calls to cudaStreamSynchronize before the calls to optixAccelBuild, but that didn’t help.

I verified the values in the index buffer look correct (in the range 0..7) in host memory.

If I try to drebug with gdb or cuda-gdb, the debugger does not stop on the failing access. If I run compute-sanitizer it doesn’t report any errors.

Suggestions that solve what dumb thing I did are appreciated.

bool BGThread::buildMeshInstances(void) {
    OptixAccelBuildOptions buildOptions = { };
    buildOptions.buildFlags = OPTIX_BUILD_FLAG_PREFER_FAST_TRACE | OPTIX_BUILD_FLAG_ALLOW_RANDOM_VERTEX_ACCESS;
    buildOptions.operation = OPTIX_BUILD_OPERATION_BUILD;
    OptixGeometryFlags geometryFlags[] = { OPTIX_GEOMETRY_FLAG_NONE };
    OptixBuildInput inputTemplate = { };
    inputTemplate.type = OPTIX_BUILD_INPUT_TYPE_TRIANGLES;
    inputTemplate.triangleArray.vertexFormat = OPTIX_VERTEX_FORMAT_FLOAT3;
    inputTemplate.triangleArray.vertexStrideInBytes = 3 * sizeof(float);
    inputTemplate.triangleArray.indexFormat = OPTIX_INDICES_FORMAT_UNSIGNED_INT3;
    inputTemplate.triangleArray.indexStrideInBytes = sizeof(unsigned int) * 3;
    inputTemplate.triangleArray.flags = reinterpret_cast<const unsigned int*>(geometryFlags);
    // Only 1 SBT record per mesh for now. Additional records for shadow rays later.
    inputTemplate.triangleArray.numSbtRecords = 1;
    inputTemplate.triangleArray.transformFormat = OPTIX_TRANSFORM_FORMAT_NONE;

    //  Create mesh objects, determine GPU memory requirements, allocate GPU memory and copy mesh objects to GPU
    unsigned int meshCount = myScene->getMeshes().size();
    meshHandles = new OptixTraversableHandle[meshCount];
    meshBufferSizes = new OptixAccelBufferSizes[meshCount];
    CUdeviceptr *vertexBuffers = new CUdeviceptr[meshCount];
    int i = 0;
    size_t totalAccelBufferSize = 0;
    for (MeshObject *object : myScene->getMeshes()) {
        vertexBuffers[i] = reinterpret_cast<CUdeviceptr>(object->getGpuVertices());
        inputTemplate.triangleArray.vertexBuffers = &vertexBuffers[i];
        inputTemplate.triangleArray.numVertices = object->getVertexCount();
        inputTemplate.triangleArray.indexBuffer = reinterpret_cast<CUdeviceptr>(object->getGpuFaces());
        inputTemplate.triangleArray.numIndexTriplets = object->getFaceCount();
        //inputTemplate.triangleArray.preTransform = reinterpret_cast<CUdeviceptr>(object->getGpuTransform());
        f_optixAccelComputeMemoryUsage(context, &buildOptions, &inputTemplate, 1, &meshBufferSizes[i]);
        meshBufferSizes[i].tempSizeInBytes = ROUNDUP(meshBufferSizes[i].tempSizeInBytes, OPTIX_ACCEL_BUFFER_BYTE_ALIGNMENT);
        meshBufferSizes[i].outputSizeInBytes = ROUNDUP(meshBufferSizes[i].outputSizeInBytes, OPTIX_ACCEL_BUFFER_BYTE_ALIGNMENT);
        totalAccelBufferSize = totalAccelBufferSize + meshBufferSizes[i].tempSizeInBytes + meshBufferSizes[i].outputSizeInBytes;
        buildInputs.append(inputTemplate);
        i = i + 1;
    }
    if (accelerationBuffer != 0) {
        cudaFreeTracked(reinterpret_cast<void*>(accelerationBuffer));
    }
    cudaMallocTracked(reinterpret_cast<void**>(&accelerationBuffer), totalAccelBufferSize);
    CUdeviceptr meshOutputBuffer = accelerationBuffer;
    f_cudaStreamSynchronize(0);
    for (unsigned int i = 0; i < meshCount; i++) {
        CUdeviceptr meshTempBuffer = meshOutputBuffer + meshBufferSizes[i].outputSizeInBytes;
        f_optixAccelBuild(context, 0, &buildOptions, &buildInputs[i], 1, meshTempBuffer, meshBufferSizes[i].tempSizeInBytes, meshOutputBuffer, meshBufferSizes[i].outputSizeInBytes, &meshHandles[i],
                        nullptr, 0);
        meshOutputBuffer = meshTempBuffer + meshBufferSizes[i].tempSizeInBytes;
    }

    // Allocate instance objects, one per mesh object, allocate GPU memory and copy instance objects to GPU
    OptixInstance meshInstances[meshCount];
    for (unsigned int i = 0; i < meshCount; i++) {
        meshInstances[i] = { };
        memcpy(meshInstances[i].transform, myScene->getMeshes()[i]->getTransform()->computeFinalTransform(), sizeof(float) * 12);
        meshInstances[i].instanceId = i;
        meshInstances[i].visibilityMask = 0xFF;
        meshInstances[i].sbtOffset = i * NUM_RAY_TYPES;
        meshInstances[i].flags = OPTIX_INSTANCE_FLAG_NONE;
        meshInstances[i].traversableHandle = meshHandles[i];
    }
    CUdeviceptr gpuInstances;
    cudaMallocTracked(reinterpret_cast<void**>(&gpuInstances), sizeof(OptixInstance) * meshCount);
    f_cudaMemcpy(reinterpret_cast<void**>(&gpuInstances), &meshInstances, sizeof(OptixInstance) * meshCount, cudaMemcpyHostToDevice);

    OptixBuildInput instanceInput = { };
    instanceInput.type = OPTIX_BUILD_INPUT_TYPE_INSTANCES;
    instanceInput.instanceArray.instances = gpuInstances;
    instanceInput.instanceArray.numInstances = meshCount;

    OptixAccelBufferSizes gasBufferSizes;
    CUdeviceptr gasTempBuffer;
    f_optixAccelComputeMemoryUsage(context, &buildOptions, &instanceInput, 1, &gasBufferSizes);
    cudaMallocTracked(reinterpret_cast<void**>(&gasTempBuffer), gasBufferSizes.tempSizeInBytes);
    cudaMallocTracked(reinterpret_cast<void**>(&gasOutputBuffer), gasBufferSizes.outputSizeInBytes);
    f_cudaStreamSynchronize(0);
    f_optixAccelBuild(context, 0, &buildOptions, &instanceInput, 1, gasTempBuffer, gasBufferSizes.tempSizeInBytes, gasOutputBuffer, gasBufferSizes.outputSizeInBytes, &gasHandle, nullptr, 0);
    f_cudaStreamSynchronize(0);

    cudaFreeTracked(reinterpret_cast<void*>(gasTempBuffer));
    delete[] meshHandles;
    delete[] meshBufferSizes;
    delete[] vertexBuffers;
    return true;
}

Hi there. I don’t have the definition for f_cudaMemcpy() but I’m used to seeing just “void*” pointers passed to cudaMemcpy rather than “void**”. gpuInstances is itself already a pointer to the buffer of instances if I’m reading the code correctly. This is one thing to check at least. Hard to debug by inspection :) Do you have a github link or anything ?

You’re right. That line should be

f_cudaMemcpy(reinterpret_cast<void*>(gpuInstances), &meshInstances, sizeof(OptixInstance) * meshCount, cudaMemcpyHostToDevice);

I changed it to that and the crash is solved. I’d stared at this for quite a while and just missed it.

The definition of f_cudaMemcpy is identical to that for cudaMemcpy where the first parameter is void *.
All that function is is a wrapper around cudaMemcpy to call cudaMemcpy and check for a failing return code.

Thanks

1 Like

Good deal!

I actually used an AI language model to hone in on that - one of the big commercial ones. I pasted your question and code, along with optix_host.h and optix_types.h as API reference. It did a little bit of hallucinating and I had to prompt it a few times but it identified the bug very quickly.