Updating the IAS each frame

Hi,

I am using a 2 level structure, an IAS with handles to around 57k GAS (instances with transforms). Each frame, I have a very small number of transforms that are updated. My understanding (which might be wrong), is that I simply want to update the transform on the corresponding OptixInstance and then do an OPTIX_BUILD_OPERATION_UPDATE on the top level IAS.

This currently looks something like this:

void Context::updateAccel() {
    auto& scene = Scene::GetInstance();
    for (const auto& [name, mesh] : scene.getMeshes()) {
        memcpy(geometryInstances[name]->instance.transform, TransformToOptixMat(mesh->worldTransform).data(), sizeof(float) * 12);
    }

    accelBuildOptionsIAS.operation = OPTIX_BUILD_OPERATION_UPDATE;
    OptixCheck(optixAccelBuild(optixContext, stream, &accelBuildOptionsIAS, &instanceInput, 1, tempBuffer.data(), tempBuffer.getSizeInBytes(), rootIAS.data(), rootIAS.getSizeInBytes(), &asHandle, nullptr, 0));

    CudaCheck(cudaStreamSynchronize(stream));
}

where the geometryInstances[name]->instance is an OptixInstance.

I am seeing this operation take almost 1 full second, is this kind of performance expected? Is there a better or faster way to do this? The actual geometry is never changing in my scene, it is simply positional/rotational updates (can be thousands of frames).

Thanks

Hey @jagj,

Your understanding sounds correct, you should only need to update the transform matrices and then do an UPDATE operation on your IAS.

When you say this operation is taking 1 second, you mean the host-side wall-clock timing of your updateAccel() function?

If you copy all of your transforms, then the total data that needs to be transferred to the GPU should be 12 * 4 * 57000 = 2736000 bytes. This should only take a small fraction of a second. I expect the optixAccelBuild() call to also similarly take a very small amount of time.

I don’t see in your code snippet where the data transfer to the GPU is occurring. Are you copying each matrix separately?

The first thing I would recommend is running your application through Nsight Systems. This way you will be able to see how much time goes to each portion of the operation - host functions, data transfer, and optixAccelBuild. I’m guessing that almost all of your 1 second is occupied by the scene.getMeshes() loop, and that the optixAccelBuild() is only a tiny blip at the end.

From the code snippet, there are a couple of things I would suspect are taking nearly all of your time: geometryInstances[name] is presumably calling a string hashing function or something? And TransformToOptixMax() looks like it might be doing host-side memory allocation (is it constructing a new std::vector? is it doing a matrix multiply? is it copying data?). Any of those things will take time, but if it’s allocating memory, that is likely to be the primary culprit.

So if there are 57k meshes in the scene, keep in mind that the loop is at least 4 separate function calls, meaning over 200k host side function calls, and it looks like some of them are not inline and non-trivial. I also can’t tell if scene.getMeshes() might be expensive.

There are some alternatives to handling your scene updates this way. One of them would be to have your simulation/update code write directly into your instance transform array. And then when it comes time to update the IAS, your updateAccel() function would consist of only 1 cudaMemcpy() and 1 optixAccelBuild(),

A different alternative is to track which instances are updated and which instances are untouched. Then you can coalesce only the dirty transforms into blocks that will be each copied separately to the GPU. You would be able to loop over your meshes and immediately skip the non-dirty ones. Coalescing the consecutive dirty transforms into blocks adds some complexity, and whether it pays off depends on how many dirty transforms you usually need to handle on average, and how often your dirty transforms are side-by-side in memory. If you only have a very small number of transforms that are dirty any given frame, you could simply copy each and every dirty transform individually, so you would call cudaMemcpy() once for every dirty transform.

First order of business for you, I think, is to do some profiling and get a better picture of what the timing is for all the sub-components of your function.

–
David.

Note I originally typed “Nsight Compute” but I meant Nsight Systems. I edited the reply, but just in case you’re getting email, and/or read it already, my suggestion is to start with an Nsight Systems profile to get a good view of what’s happening during the 1 second. Nsight Systems will give you a visual timeline of all the function calls during that time.

–
David.

Hi @dhart ,

I apologize, I accidentally left out the GPU transfer for the instance data, however like you thought the culprit does not appear to be in this function at all. After doing some profiling, it looks like the optix launch for the camera is actually increasing in time from around 0.0192613s to 1.33767s (give or take a millisecond). The first time is if I build the accel structure once and then never update. If I begin updating each frame, then the launches suddenly spike quite dramatically.

I am profiling just the launch itself with the following:

std::chrono::time_point<std::chrono::system_clock> start, overallStart;
std::chrono::time_point<std::chrono::system_clock> end, overallEnd;
std::chrono::duration<double> elapsed;
start = std::chrono::system_clock::now();
camera->launch(context->pipelines[sensor.first]->pipeline, launchParamsBuffer.data(), launchParamsBuffer.getSizeInBytes(), &context->pipelines[sensor.first]->sbt);
end = std::chrono::system_clock::now();
elapsed = end - start;
std::cout << "camera time: " << elapsed.count() << "s\n";

where camera->launch is:

void Camera::launch(OptixPipeline pipeline, const CUdeviceptr launchParamsBuffer, size_t sizeInBytes, OptixShaderBindingTable* sbt) {
    OptixCheck(optixLaunch(pipeline, stream, launchParamsBuffer, sizeInBytes, sbt, cameraBuffers->res.x, cameraBuffers->res.y, 1));

    CudaCheck(cudaStreamSynchronize(stream));
}

I will look into using the NSight System to hopefully pinpoint this some more, but off the top of my head I’m not quite sure why the performance would decrease so dramatically for the actual raycasts. In the scene I am rendering, there is only a single object whose transform is changing (and actually doesn’t change until around frame 100 and I see the performance impact before the transform values have actually changed).

My first thought based on this info is perhaps the BVH quality is degrading due to the UPDATE operation. You can tell if this is happening by switching to a full BUILD operation instead. The UPDATE operation itself is faster than BUILD, but once things move around a lot inside the BVH, then the traversal performance can suffer.

In the scene I am rendering, there is only a single object whose transform is changing (and actually doesn’t change until around frame 100 and I see the performance impact before the transform values have actually changed).

Okay this part doesn’t make sense yet to me. Render times should stay constant if the scene isn’t changing. This is a case where if it’s not some kind of measurement bug, then an Nsight Compute profile is in order. With Nsight Compute, you might be able to see why the render (“raygen”) kernel is going slower, for example, if it’s due to cache misses increasing significantly.

A few things you could try/check:

  • Use a BUILD operation instead of UPDATE.
  • Disable your instance transform updates completely. See if the launches still slow down.
  • Check to make sure the order of instances in the transform array isn’t getting mixed over time.
  • Check whether any moving instances are very close to the camera.
  • Profile your OptiX launch in Nsight Compute, once when it’s fast, and again when it’s slow. Check to see what’s coming out different.

–
David.

  • Check to make sure the order of instances in the transform array isn’t getting mixed over time.

This was the issue, the order was being changed (due to using maps to populate the GPU buffer and not keeping the same order between the build vs update calls). Changing the operation to BUILD showed way better performance and then fixing the order fixes the issue completely for the update operation.

Thanks for the help!

1 Like