OptiX crashing when launching pipeline with big data

Hello everyone, I try to perform Baked Ambient Occlusion using OptiX on a triangular surface.
To achieve this goal, I generate numberOfRays = numberOfSamples * numberOfVertices.

My hardware is AMD Ryzen 5 3600, GTX 1070.
I use OptiX 7.2, NVCC = 11.2, GCC 10.2.0.

The following code launches the pipeline

    void launchAoPipeline(AoState& aoState)
    {
        CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&aoState.d_params), sizeof(Params)));
        CUDA_CHECK(cudaMemcpy(reinterpret_cast<void*>(aoState.d_params), &aoState.params, sizeof(Params),
                              cudaMemcpyHostToDevice));

        unsigned int numberOfRays = aoState.params.mesh.numVertices * aoState.params.aoSamples;
        std::cout << "\n\n\n\n" << numberOfRays << "\n\n\n\n";
        // Launch pipeline
        OPTIX_CHECK(optixLaunch(aoState.pipeline, nullptr /*default stream*/,
                                reinterpret_cast<CUdeviceptr>(aoState.d_params), sizeof(Params), &aoState.sbt,
                                numberOfRays, 1, 1));
        CUDA_SYNC_CHECK();
    }

The following code are my kernels.

extern "C"
{
__constant__ Params params;
}

extern "C" __global__ void __raygen__ao()
{
    // Lookup location in the launch grid
    const unsigned int vertexId = optixGetLaunchIndex().x / params.aoSamples;
    const unsigned int rayId = optixGetLaunchIndex().x % params.aoSamples;

    // The origin of the ray is the location of the current vertex
    float3& rayOrigin = params.mesh.vertices[vertexId];
    float3& normal = params.mesh.normals[vertexId];

    float3 rayDirection = params.rayDirections[rayId];
    if (dot(rayDirection, normal) < 0)
    {
        // reverse ray
        rayDirection *= -1;
    }

    // Cast ray
    optixTrace(params.gasHandle, rayOrigin, rayDirection, 0.0f, 1e16f, 0.0f, OptixVisibilityMask(255),
               OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT, 0, 0, 0);
}

extern "C" __global__ void __closesthit__ao()
{
    const unsigned int vertexId = optixGetLaunchIndex().x / params.aoSamples;
    atomicAdd(&params.mesh.rayHits[vertexId], 1);
}

The surface that I am testing has:

Surface vertices: 17,567,820
Surface triangles: 35,148,720

If I have 32 samples, numberOfRays = 562,170,240, everything runs just fine.
If I have 64 samples, numberOfRays = 1,124,340,480, i get the following error:

[ 2][       ERROR]: Error launching work to RTX
terminate called after throwing an instance of 'sutil::Exception'
  what():  OPTIX_ERROR_LAUNCH_FAILURE: Optix call 'optixLaunch(aoState.pipeline, nullptr , reinterpret_cast<CUdeviceptr>(aoState.d_params), sizeof(Params), &aoState.sbt, numberOfRays, 1, 1)' 

Any ideas why this might be happening? Is there some kind of limit? Do I need to split the pipeline for big data knowing that limit?

Hi @spyridon97, welcome to the OptiX forum!

There is an upper limit to the OptiX launch size, currently it is 2^30.

Would it be feasible to work around this by looping over your number of samples? That way your launch size would be 17,567,820. This launch size is plenty large to fully saturate the GPU, so I think there will be no harm in adding a sample loop to your raygen program. It may even reduce divergence to do so.


David.

Thanks a lot for letting me know about the limit!!

I tried to have the loop inside the kernel and I saw that this way I get 1.7x worse results (for surfaces with 500k vertices)

Version 1) If Ι loop over the samples inside kernel:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.91%  6.70587s         1  6.70587s  6.70587s  6.70587s  __raygen__ao_0x5382ebcb21e3059a_ss_0
                    1.95%  133.30ms        70  1.9042ms  2.3680us  41.877ms  NVIDIA internal
                    0.14%  9.5884ms        14  684.89us     704ns  4.7839ms  [CUDA memcpy HtoD]
                    0.01%  439.07us         1  439.07us  439.07us  439.07us  [CUDA memcpy DtoH]
                    0.00%  25.376us         4  6.3440us     768ns  22.688us  [CUDA memset]
                    0.00%  4.3520us         2  2.1760us  1.9840us  2.3680us  [CUDA memcpy DtoD]

Version 2) if I launch multiple pipelines for every sample:

 GPU activities:   97.83%  6.73851s       128  52.645ms  38.424ms  63.892ms  __raygen__ao_0x6be7943ef735f7a2_ss_0
                    2.00%  138.07ms        70  1.9725ms  2.4000us  43.178ms  NVIDIA internal
                    0.14%  9.9005ms       268  36.941us     704ns  4.8149ms  [CUDA memcpy HtoD]
                    0.01%  803.94us         1  803.94us  803.94us  803.94us  [CUDA memcpy DtoH]
                    0.01%  535.84us       256  2.0930us  1.8560us  2.4640us  [CUDA memcpy DtoD]
                    0.00%  48.672us         4  12.168us     768ns  45.888us  [CUDA memset]

Version 3) If I use the code that I posted here:

 GPU activities:   96.37%  3.91939s         1  3.91939s  3.91939s  3.91939s  __raygen__ao_0xd3007ee3d797c47a_ss_0
                    3.37%  136.96ms        70  1.9566ms  2.3360us  44.044ms  NVIDIA internal
                    0.24%  9.7222ms        14  694.44us     704ns  4.8275ms  [CUDA memcpy HtoD]
                    0.02%  801.44us         1  801.44us  801.44us  801.44us  [CUDA memcpy DtoH]
                    0.00%  53.184us         4  13.296us     800ns  50.336us  [CUDA memset]
                    0.00%  4.7040us         2  2.3520us  2.1440us  2.5600us  [CUDA memcpy DtoD]

Are you still using atomicAdd() inside your inner sample loop, when you use the loop? If so, can it be removed or put outside the loop? I was imagining that you would average the samples in your raygen program before writing the result to memory, and that you wouldn’t need to use an atomic at all.

Also if your vertices have any spatial locality, then it might help to factor your numberOfRays variable into two numbers that are near the square root of numberOfRays. So you would try to find the smallest two numbers m and n such that m * n >= numberOfRays. In general there will be slightly too many threads, so you’d want to add a test at the beginning of raygen to check if the index is valid, and if not then exit immediately.

The reason I think this could help you is because OptiX tiles the threads in a 2D launch into small 4x8 blocks in order to increase the coherence of all the ray directions in a warp, and it usually makes a pretty big difference for primary camera rays. In your case, this coherence would be partly in the form of ray origins that are closer together for all the threads in a warp.

If both of those suggestions are applicable, I’d recommend doing them one at a time and testing the performance difference for each one separately.


David.

For Version 1 & 2 (check my previous reply), i don’t use atomics cause as you mention, there is no need.

I managed to perfom a batch of pipelines using the max limit.

Launch pipeline code

    void launchAoPipeline(AoState& aoState)
    {
        CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&aoState.d_params), sizeof(Params)));

        // The launch of the pipeline has a hard limit of width * height * depth = 1,073,741,824.
        // Therefore we need to batch.
        const unsigned int maxBatchSize = 1073741824;
        const unsigned long long numberOfRays = aoState.params.mesh.numVertices * aoState.params.aoSamples;

        unsigned long long raysRemaining = numberOfRays;
        unsigned long long batchSize;
        for (unsigned long long i = 0; i < numberOfRays; i += maxBatchSize)
        {
            aoState.params.currentBatchStartingIndex = i;
            // Set batch size
            batchSize = raysRemaining > maxBatchSize ? maxBatchSize : raysRemaining;

            CUDA_CHECK(cudaMemcpy(reinterpret_cast<void*>(aoState.d_params), &aoState.params, sizeof(Params),
                                  cudaMemcpyHostToDevice));

            // Launch pipeline
            OPTIX_CHECK(optixLaunch(aoState.pipeline, nullptr /*default stream*/,
                                    reinterpret_cast<CUdeviceptr>(aoState.d_params), sizeof(Params), &aoState.sbt,
                                    batchSize, 1, 1));
            CUDA_SYNC_CHECK();

            // Update raysRemaining and void change of sign
            raysRemaining = raysRemaining > batchSize ? raysRemaining - batchSize : 0;
        }
    }

kernels code

extern "C"
{
__constant__ Params params;
}

extern "C" __global__ void __raygen__ao()
{
    // Lookup location in the launch grid
    const unsigned long long currentIndex = params.currentBatchStartingIndex + optixGetLaunchIndex().x;
    const unsigned long long vertexId = currentIndex / params.aoSamples;
    const unsigned long long rayId = currentIndex % params.aoSamples;

    // The origin of the ray is the location of the current vertex
    float3& rayOrigin = params.mesh.vertices[vertexId];
    float3& normal = params.mesh.normals[vertexId];

    float3 rayDirection = params.rayDirections[rayId];
    if (dot(rayDirection, normal) < 0)
    {
        // reverse ray
        rayDirection *= -1;
    }

    // Cast ray
    optixTrace(params.gasHandle, rayOrigin, rayDirection, 0.0f, 1e16f, 0.0f, OptixVisibilityMask(255),
               OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT, 0, 0, 0);
}

extern "C" __global__ void __miss__ao()
{
    // Do nothing
}

extern "C" __global__ void __closesthit__ao()
{
    const unsigned long long currentIndex = params.currentBatchStartingIndex + optixGetLaunchIndex().x;
    const unsigned int vertexId = currentIndex / params.aoSamples;
    atomicAdd(&params.mesh.rayHits[vertexId], 1);
}

Using this version I maintain the 1.7 speed-up for the big dataset that originally had trouble with.

I think that by using the current implementation that I have, I implicitly define spatial locality between every kernel call.
I mean, the first batch of 4*8 kernel call is going to execute 32 rays of the first vertex (assuming at least 32 aoSamples), right?