Animations in optix 6 (previously performed using selectors)

I am working with animations of animation meshes limited by the same/a similar bounding box and with preserved topology (i.e., vertex count, half-edge structure). The animation consists of over eight and maximally 30 key frames. The results from the ray tracing are processed and viewed in real time, i.e. computation time is critical and an uploading at every step currently out of question. I have a low number of rays, less than 10000, and a high ray depth, up to 60, due to the transparency of the objects in the scene.

For this setting, using selectors was perfect in the past - almost no impact on the frame rate. Now changing towards optix 6 with the rtx mode and thus deprecated selectors, I need to find a proper replacement.

https://devtalk.nvidia.com/default/topic/1055044/optix/update-to-optix-6-0-from-5-0-1-crashes-with-canonicalstate-still-used-in-function/post/5348464/#5348464. suggests using RTrayflags which allow behaviour changes per ray:
https://raytracing-docs.nvidia.com/optix_6_0/api_6_0/html/optix__declarations_8h.html#ab847419fd18642c5edc35b668df6f67d. To my understanding, this does not help for animations, as we need to switch off some of the geometries and not the rays.

I have more than 8 frames, so the visibility masks do not work without changing the context.

https://devtalk.nvidia.com/default/topic/1056070/optix/optix-raise-error-when-finding-child-node-of-transform/post/5354109/#5354109 proposes using optixDynamicGeometry. This is the part where I could find the less information online. I have the impression, that it is difficult to use in my context, since the change of position in the animated key frames can be almost random. Please let me know if you see potential there.

I looked into the motion blur sample and the documentation online, which was mentioned in https://devtalk.nvidia.com/default/topic/1043459/acceleration-structure-memory-consumption-/?offset=1#5293341. With the linear interpolation, the key frames could potentially be reduced to four or eight. My first goal would be, to change between two key frames without any (motion) blur, i.e. first showing the first and then the second key frame, potentially with one unblurred interpolated frame in between. To my understanding, the change should be possible either via the motion range (i.e. the time interval of the object movement) or via the current time that is propagated using

rtTrace(top_object, ray, TIME, prd);

in pinhole_camera.cu or in accum_camera_mblur.cu. I could only see the effect of the motion range on the bluriness of the image, but no clear change between two key frames. Could you help me how I can make a step in the right direction?

In both posts above changing the tree is mentioned. Similar to that idea I believe one could circumvent using selectors by defining a geometry animation_boundary that surrounds the key frames. The structure would be the following:

optix::Group animation_boundary = context->createGroup();
// Set the acceleration of the animation boundary 
// Set the geometry of the animation boundary 
top_object→addChild(animation_boundary); 
...
animation_boundary->setChild(key_frame_i);
...

One could forward the rtObject rt_object_i of every key_frame_i from the host to the device

context["rt_object_i"]->set(rt_object_i);

and run

rtTrace(rt_object_i, ...)

where an additional variable helps to decide which frame should be used.

I like this approach and with my current knowledge it would be the one I would implement, since no recomputation of the acceleration structures should be necessary. But I do not know how to properly forward the rtObject - I only see the way of hardcoding e.g. 30 key frames.

Is it possible to use getChild of an rtObject (similar to the group on the host side) as well on the device side?
Can I forward a vector of rtObjects? Or, to formulate it differently: how did you decide about the use case as response to: https://devtalk.nvidia.com/default/topic/1044745/optix/how-to-pass-a-buffer-of-graph-nodes-to-optix-/post/5301685/#5301685.

I am open for other ideas to approach the animation, please let me know if you need more information!

Hi, I have a few questions. What GPU do you have? (Or if you’re writing a program others will use, what minspec GPU are you targeting?) What is your framerate requirement? And do you currently have any extra time during your frame, or is your GPU already maxed out? Do you need to support visible motion blur, or are you only looking at motion blur as a way to sort-of compress your animation keyframes?

My first few thoughts of things to try are:

If you want to pre-build all your accels, and you have room to duplicate your static geometry, then you could perhaps create a separate scene & accel for each frame, and write your own selection code that would go on the CPU side, or in your raygen program.

You could use BVH refitting rather than BVH rebuilding, it’s much faster.

Have you tried moving a single copy of the geometry and rebuilding it every frame? You could do that in CUDA so that you don’t need to re-upload the data every frame, and then either rebuild or refit your BVH.

A couple of things to be aware of:

  • Motion blur features aren’t as fast as tracing without motion blur. If you don’t need visible blurring, you shouldn’t enable motion blur.

  • Updating context variables every frame can cause problems in dynamic scenes. Use a buffer instead and put any dynamic updates there. A fix for this is being released in our very next driver release, but if you use a buffer you don’t need to wait.


David.

Thank you for your fast response!

@What GPU do you have?
I have an NVIDIA Quadro RTX 4000, driver version 26.21.14.3064, cuda 9.2, optix 6.0.

@What is your framerate requirement?
My current framerate is at 5fps, with the RTX I go up to 9-10 fps, my goal would be 20fps (I have other improvements ongoing).

@And do you currently have any extra time during your frame, or is your GPU already maxed out?
I have difficulties to answer that. Will double check with colleagues on how I could look into that.

@Do you need to support visible motion blur, or are you only looking at motion blur as a way to sort-of compress your animation keyframes?
The second. I do my own blurring effects afterwards, so I prefer to have as crispy images as possible. With your second remark “Motion blur features aren’t as fast as tracing without motion blur”, I will put implementing the animation with motion blur aside.

@create a separate scene & accel for each frame
My scene is pretty big (500k triangles and 180 objects) and start up time is crucial as well. How can it be prevented that a duplication of the structure does not have a negative impact on the start up or the run time?

Based on your thought, I could actually change the rtObject every frame and use the idea mentioned in my first post (when I talk about “circumvent using selectors”) without having to upload a vector of rtObjects. Is this what you wanted to say by moving a “single copy of the geometry”?

I will keep the point related to uploading the context variables in mind.

I will keep you posted about my evaluation considering the extra time during my frames and about the further development.

My scene is pretty big (500k triangles and 180 objects) and start up time is crucial as well. How can it be prevented that a duplication of the structure does not have a negative impact on the start up or the run time?

I was assuming you were prioritizing trace time over startup time. If startup time is critical, then creating and uploading a copy of the scene for each frame is a bad idea.

Based on your thought, I could actually change the rtObject every frame and use the idea mentioned in my first post (when I talk about “circumvent using selectors”) without having to upload a vector of rtObjects. Is this what you wanted to say by moving a “single copy of the geometry”?

My suggestion here, and I think probably the recommended approach, would be: upload 1 copy of your scene, upload your animation data to the GPU, and use either a CUDA kernel or a separate OptiX raygen program to apply your animation to your dynamic geometry, and then either rebuild or refit all your moving acceleration structures each frame before rendering. I recommend trying rebuild first, and if that’s fast enough, you don’t need to worry about refit. There is a tradeoff between rebuild and refit, so you can test if refit is faster, if you need to. The tradeoff is that refit might be slower to render when your moving geometry moves very far.

IIRC, a rebuild of a single 500k triangle mesh on a Quadro RTX is expected to take somewhere in the neighborhood of 5ms. A refit is typically around 10x faster, so perhaps more like 0.5ms for a single 500k triangle mesh. For many small meshes, it can take longer since there is some per-mesh overhead, but smaller meshes also rebuild & refit faster too. You mentioned you have ~200 objects in your scene, what percent of objects & triangles are moving?

To me it sounds like your animation and BVH rebuild time might be a small fraction of your render time, so there might not be strong reasons to avoid re-evaluating animation and accels every frame, there might not be any need to replace selectors.

I have a low number of rays, less than 10000, and a high ray depth, up to 60, due to the transparency of the objects in the scene.

FWIW, this sounds to me like the lowest-hanging fruit in terms of increasing your frame rate. The depth in particular is going to lead to multiple challenges. First, the rays traced are dependent so it takes a lot longer even though your total ray counts are quite low. Second, you are likely to have very imbalanced threads with some tracing only a few rays and some tracing 60 rays. Keep in mind that you’re always paying the cost of the longest running thread in each warp, which is each group of 32 threads. So if 31/32 threads terminate at a depth of 1, but only 1 in every 32 threads goes to depth 60, your render time would be almost 60 times slower than it needs to be.

I don’t know enough to be able to say how to reorganize your renderer, but I recommend contemplating ways to keep the number of rays traced per thread to a fixed small number. This might, just for example, include tracing up to a max depth of, say, 4, and then collecting all the remaining active threads and doing another launch to pickup depths 4-8, and repeat. I’m vaguely suggesting some kind of wavefront approach. I hope that helps!


David.

Excuse my late reply.

To come back to the question of your first post regarding how much of the GPU is used: Beyond just checking the GPU usage with the task manager (very low usage!), I checked with nsight. But I can not see any details regarding the usage of the GPU, probably since the profiling support has not yet been released as mentioned in https://devtalk.nvidia.com/default/topic/1047546/optix/optix-profiling-using-nsight/post/5357599/#5357599. What tool would you suggest me to use? I looked into the posts, searching for nsight, but could not find something describing the point detailed.

Thank you for confirming, that the duplication might not be the best choice - I will put that aside for now.

upload 1 copy of your scene, upload your animation data to the GPU, and use either a CUDA kernel or a separate OptiX raygen program to apply your animation to your dynamic geometry

Do you talk here about optix dynamic geometries?

I will evaluate your point considering refitting BVH.

there might not be any need to replace selectors
As far as I understand, either way I have no choice when I want to profit of the complete speed up (e.g. coming from geometry triangles) from the RTX card, no?

You mentioned you have ~200 objects in your scene, what percent of objects & triangles are moving?
Currently 10000 triangles and 3 objects are exchanged at every change of a frame. This however could increase in the future to 100000-200000 (no clue about the objects), since I would like to move (but not forcefully change the position of each triangle, so it could work with the transform of the node) a big part of the scene - but lets keep that aside for now, I will write a post as soon as it gets relevant.

Since my last post I implemented a prototype that just uploads the animated structures at every change of the animation frame. On smaller scenes this already does its job, I am still working on the bigger scenes (with the 10000 triangles), could not yet identify why there are some issues, probably a minor error.

lowest-hanging fruit … the depth … wave front approach
Very good point, we thought of something similar: stopping at a certain ray depth and then increasing the parallelism of the next frame by launching new rays at the camera origin and at the depth where we stopped. At the end the results of several frames get stitched together across the transparent objects. Do you have any good references that worked on something similar? In our application the camera can move, thus the global path consists of local paths shot at different camera positions. This would result in a delayed update of the image, which in our case probably is not so important, obviously depending on how much it is in the end.

As first step I’m always using the following nvidia-smi command in an own command prompt.
Copy that line into a *.bat file and just double click it.
The last number is the time of milliseconds between reports which means this example is printing the statistics for all visible CUDA devices once per second.

"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" --format=csv,noheader --query-gpu=timestamp,name,pstate,memory.total,memory.used,utilization.memory,utilization.gpu --loop-ms=1000

@Detlef Roettger: Thank you for the proposition of the tool.
@David Hart: I currently have an issue with a bug when using optix 6 and thus can not yet check the performance.

I implemented a version using optix 5.1, that does the following:

At initialization, all the frames (i.e. vertices, normals and the indices of the points in all triangles) of the animations are loaded into optix buffers. The first frame of an animation is introduced as a child - lets call it c - of the current scene, all others are not introduced as children.

At every animation frame f, I replace the geometrical representation of the child c:

// Get the geometry of the child
optix::Geometry geometry = c->geometry(); 
// Write current vertices in the childs geometry
geometry["vertex_buffer"]->setBuffer(f->positionsBuffer().optixBuffer()); 
// Write current normals in the childs geometry
geometry["normal_buffer"]->setBuffer(f->normalsBuffer().optixBuffer()); 
// Write current indices in the childs geometry, even if they should be the same in our case
geometry["index_buffer"]->setBuffer(f->indexBuffer().optixBuffer()); 
// Mark the childs geometries as dirty
c->accelerationStructure()->markDirty();

What do you think about this approach?

There is no slow-down when comparing to selectors. Is it correct, that this approach does not upload the vertices, normals and indices, since they are already in an optix buffer? If not, how can I go towards only uploading once?

I currently still have some artefacts, I believe due to an incorrect update of the acceleration structures. Or is there something else missing when looking at the example above?

In order to evaluate/understand the acceleration structures, I would like to visualize them - if possible. I will go through the forum and the other information and might post another topic if I can not find the information.

Thank you for your fast responses, this was a very good first experience with the forum, I will be back! ;)

Just out of curiosity:
http://on-demand.gputechconf.com/gtc/2018/video/S8518/ @ 16:14:
could the sharing of only one acceleration structure prevent the rebuild?

Or to formulate it better: can I use a acceleration structure for all of them, that does not have to be updated, that does contain all the structures?

Hi, I’m not sure I have all the details of your situation straight, I’m not sure I understood everything you said correctly, so forgive me if I’m repeating things you know…

  • Make sure you’re marking the top level acceleration structure dirty too.

  • You can only share acceleration structures for nodes that contain exactly the same geometry in exactly the same pose. You can’t share accels between two nodes if the geometry is different in any way aside from the node’s transform. Sharing doesn’t affect whether you can build or rebuild the accel.

  • Because you’re doing animation & dynamic updates, I would suggest finding a way to not update the buffer pointers using OptiX variables; that is going to trigger larger uploads to the GPU every frame than you want. It will be better if you can upload a single static buffer with all the animation data, and index it using a frame number and/or use CUDA interop to evaluate your animation on the GPU side. To be clear, what I’m talking about is avoiding the use of geometry[“vertex_buffer”]->… If you can use the animation frame as an index into a larger static buffer of animation data, it’ll be easier and faster, at the expense of complicating your code with the indexing arithmetic. Any variables you want to upload per frame are best put into a buffer themselves. In OptiX 6, once there are more than a few variables updated, your whole scene along with all your transforms will be scheduled for copy, even if only a small portion of it changed. This isn’t a problem for single-frame static renders, but it is a problem for interactive and dynamic apps. By arranging the per-frame update data into your own buffers, you’ll have greater control over what gets copied to the GPU every frame. There are short term and long term updates to improve this situation coming soon. More information in this thread: https://devtalk.nvidia.com/default/topic/1055077/optix/optix-bad-performance-with-dynamic-objects/


David.

This resolved the artefacts, you were pretty well aware of my setting, thank you!

Nice to have these details: So it does not apply to my case and I will put it aside.

I have to say I struggled a little to put the thoughts together below. And I reach the end of the limited time box set in my planning for this part. Thanks to your help (!!!), I have a working implementation which has no visible slowdown compared to the previous implementation and will go with that one for now.

However, as I mentioned in a previous post in this topic, I would like to animate objects with more triangles and I might implement your buffer solution in the future. In order to help others with similar challenges, I will state my thoughts below, excited to see your comments/corrections for a future implementation!

I try to prevent having separate buffers for vertices, normals and indices, by using optix::Groups.

For the declarations of the animations_buffer, on the host:

// optix::Context ctx;

optix::Buffer animations_buffer;
animations_buffer = ctx->createBuffer(RT_BUFFER_INPUT);
animations_buffer->setFormat(buffer_format);
animations_buffer->setElementSize(size(optix::Group));
// the first entry is the geometry declared as child (see below)
animations_buffer->setSize(n_animations,max_number_frames+1); 

std::string name = "animations_buffer";
optix::Variable animations_variable = ctx[_name.c_str()];
animations_variable->set(animations_buffer);

And on the device:

rtBuffer<rtObject, 2> animations_buffer;

At init fill the buffer:

// Fill a vector with the optix::Group on the CPU
std::vector<std::vector<optix::Group>> animations;
for each animation
{
    std::vector<optix::Group> animation_frames;
    // The first frame introduces a child
    int child_id = _top_object->addChild(frame0->transformNode());
    // The group which introduced the child is added twice in order to not be overwritten by the other groups
    animation.push_back(_top_object->getChild(child_id));
    for each frame including the first one that has been used for the child creation
    {
        optix::Group frame_group = ctx->createGroup();
        optix::Acceleration frame_accel = _context->createAcceleration("Trbvh");
        frame_group->setAcceleration(frame_accel);
        animation_frames.push_back(frame_group);
    }
    animations.push_back(animation_frames);
}
// Write the animation vector into the buffer on the GPU
void* buffer_data = animations_buffer->map();
memcpy(buffer_data, animations.data(), animations.size()*size(animations[0])); 
animations_buffer->unmap();

Update of the frames: replace the first entry in the animations: how/where do I do that, in order to prevent reuploads of any kinds and to make sure that the acceleration structures are correct?

Thank you for your patience until now!

Hi, when you say “CUDA interop to evaluate your animation on the GPU side”, do you mean that you would have a CUDA kernel (or a dedicated ray-gen OptiX program) that is simply executed every animation frame right before the actual ray-tracer kernel is launched, such that in this dedicated kernel, animated data (e.g., vertices positions) for a particular frame, for instance, is copied over from a single shared large static buffer (which is on the device and contains entire animation data for all the frames), into the vertex buffer that is used in an intersection program. Which, from my understanding, allows to prevent an upload of data from the host to the device (by performing such copy from one memory chunk in the device, to another that is also in the device).

Which brings me to my second question regarding copying data over in the device code:

geometry["vertex_buffer"]->setBuffer(buffer);

I assumed that “Buffer” object contains a pointer to data that is already supposed to be in the device, so my impression was that swapping pointers this way shouldn’t cause any data transfer between host and device. Is that wrong? I am using OptiX 6.5 and CUDA 10.1.243. Thank you very much for your extensive clarifications above.

Yes, exactly right. Put all the animation source data on the GPU, use a CUDA kernel to evaluate the animation for a given frame, and store the results into a GPU buffer that can be used as input to the OptiX BVH build and trace functions.

OptiX versions up to and including 6.5 have several different kinds of “Buffer” objects. The most common kind of buffer used for dynamic, multi-frame scenes is the input-output buffer. There are also input-only buffers, output-only buffers, and device-only buffers. Most of the buffers (input, output, input-output) allocate both host-side storage and device side storage. The data copies are triggered by the OptiX map() and unmap() function calls. Whether data is copied depends on what kind of buffer you asked for. With an input-output buffer, map() will copy the buffer’s contents from the GPU to the host, and unmap() will copy from the host to the GPU. An input buffer will only copy from the host to the GPU, and similarly an output buffer will only copy from the GPU to the host. And the copying, by the way, is asynchronous; it doesn’t happen during the map/unmap calls, it happens before and after launch, so attempts to measure it might make it look like rendering (launch) is slow when it’s really spending time copying data over the PCI bus.


David.