Running out of memory on host well before device

I am running into memory issues and was hoping to get some feedback from others to make sure everything sounds like it is working as expected.

First, I’ll give some background on the systems I am working with. The main system I work on has a GTX Titan GPU (6GB memory), and 64GB RAM on the host. The other system I have tried has a Tesla K80 (12GB memory) , and 32GB RAM on host. In both cases I’m running OptiX 3.9 with CUDA 7.0.

I have been running some scaling tests, and querying both host and device memory to get an idea of when and where I run out of memory for very large scenes. In both cases, the OS plus building all the geometry in my API (pre-OptiX) uses roughly 3 or 4 GB of RAM on the host. On the K80/32GB system, I’m running out of memory on the host at about 4 million primitives, yet at this point I have only used about 3GB of memory on device (still about 8 or 9GB left). I can watch the memory on the host draw down as my OptiX program progresses. After adding the geometry in OptiX and compiling the context, I find that I’ve used about 20GB of RAM on the host. Then I do a dummy launch with a single ray to build the acceleration structures, and that is when it runs out of memory on the host. On this particular system, it errors out at that point because it is not allowed to go into swap space. On my other system, it seems to keep going but with a considerable performance hit as it moves into swap space. So on the TITAN/64GB system I can keep running bigger and bigger problems until eventually the device runs out of memory at about 6 million primitives (for this case it uses about 40GB of memory on the host before even getting to the acceleration structure build!). So I would conclude that more than half of the host memory the program uses happens between the time the OptiX context is initialized and when it is compiled. A little less than half of the host memory is used up during the ‘dummy’ launch. When I do a back-of-envelope calculation of how much memory I should be using for buffers and variables, it should only be on the order of about 500MB, so it doesn’t seem to be coming from there.

So my question is - does this all sound like expected behavior, or may there be a memory problem going on somewhere? It seems crazy to me that OptiX would use so much memory on the host, especially even before the acceleration structures have been built. It also seems that memory usage is exponential as number of primitives are increased.

Thank you in advance for any comments.

This does not sound like expected behavior. This part in particular:

So you loaded the geometry into your API and used about 4 GB, then you built an OptiX scene and it went to 20 GB? Am I interpreting that correctly? We need to figure out where that 20-4=16 GB is going. You might be able to track it back to buffers, or something else, by experimentation.

One thing to be aware of: for the Trbvh builder, OptiX will try to build on the device first. However, if it runs out of memory on the device, e.g., if your input geometry already exceeds device memory, then it will silently fall back to a host build and use host memory. I suspect the out-of-memory condition during the build is caused by having too much geometry going into the build, though.

Correct, but the 3-4GB of host memory used pre-OptiX is combined for everything on my whole system including the OS. Then, after building the geometry in OptiX and compiling the OptiX context (essentially everything between and including the context initialization and compilation), I get an increase in host memory usage by ~20GB or even up to 40GB in the largest case I’ve run which is about 6 million primitives.

One other point I should clarify, which is the order in which things happen (for very large scenes): so first what happens is the host memory gets completely filled before the GPU runs out of memory. The host memory then spills over to swap space on my SSD (if allowed to). Then, the device memory fills up and spills over to host memory, which is already full and pushes more host memory into swap space. By the time of my ray launching I end up with a condition where both host and device memory are at 100% utilization, and things obviously run very slowly.

Anyway, I guess the most important point was your original statement, which is that it does not seem like normal behavior. I can go through and start cutting things out to see where all the memory is coming from. I wanted to check with others first that something indeed appears to be wrong before investing the time in trying to track it down. It seemed strange to me, but I didn’t know enough about what is going on behind the scenes in OptiX to be sure.

Ok, so more than 16 GB of unexplained host mem, possibly.

I would start by printing mem usage after each buffer map/unmap call, since that’s the point at which OptiX creates a host allocation for the buffer.

I’d also like to distinguish OptiX mem usage before the compile (which is mostly host backing for buffers and scene objects) vs. during/after the compile. If you’re loading just a massive amount of PTX programs during the compile, you might see a memory spike as the PTX gets JIT’ed and turned into lower level assembly. I really doubt that’s what is going on, but still worth separating imo. I’m still expecting that you see unexplained memory usage before the compile.

I’ve done some poking around, and narrowed down the high host memory allocation to my function that adds geometry instances. Please consider the function below:

RTgeometryinstance Model::addGeometryInstance( RTgeometry geometry, RTmaterial material, uint primitive_type, float m[16], uint UUID, RTgroup &top_level_group ){

  RTgeometrygroup geometrygroup;
  RTgeometryinstance instance;
  RTacceleration acceleration;
  RTtransform transform;

  /* Create this geometry instance */
  RT_CHECK_ERROR( rtGeometryInstanceCreate( OptiX_Context, &instance ) );
  RT_CHECK_ERROR( rtGeometryInstanceSetGeometry( instance, geometry ) );
  RT_CHECK_ERROR( rtGeometryInstanceSetMaterialCount( instance, 1 ) );
  RT_CHECK_ERROR( rtGeometryInstanceSetMaterial( instance, 0, material ) );

  /* create group to hold instance transform */
  RT_CHECK_ERROR( rtGeometryGroupCreate( OptiX_Context, &geometrygroup ) );
  RT_CHECK_ERROR( rtGeometryGroupSetChildCount( geometrygroup, 1 ) );
  RT_CHECK_ERROR( rtGeometryGroupSetChild( geometrygroup, 0, instance ) );

  /* create acceleration object for group and specify some build hints*/
  RT_CHECK_ERROR( rtAccelerationCreate(OptiX_Context,&acceleration) );
  RT_CHECK_ERROR( rtAccelerationSetBuilder(acceleration,"NoAccel") );
  RT_CHECK_ERROR( rtAccelerationSetTraverser(acceleration,"NoAccel") );
  RT_CHECK_ERROR( rtGeometryGroupSetAcceleration( geometrygroup, acceleration) );
  
  /* add a transform node */
  RT_CHECK_ERROR( rtTransformCreate( OptiX_Context, &transform ) );
  RT_CHECK_ERROR( rtTransformSetChild( transform, geometrygroup ) );

  /* set transformation matrix */
  RT_CHECK_ERROR( rtTransformSetMatrix( transform, 0, m, 0 ) );

  RT_CHECK_ERROR( rtGroupSetChild( top_level_group, UUID, transform ) );

  return instance;

}

There may be some background necessary. This function is called N times in a loop, where N is the number of elements - say 2 million for discussion. To run that loop, which only calls this function, it adds about 11GB of host memory for 2 million elements. If I comment out lines 9-32, it reduces the host memory usage to about 1GB. I have also tried commenting out other portions of the routine, and they do also reduce memory usage, but the majority seems to be coming from lines 15-17.

So is this not the proper method for adding instances? Everything works in the end and the program gives the correct answer, but maybe this is not the right way?

This pattern looks ok to me, although 2M instances is on the high side.

If I want to repro this locally, what are the primitives? Would something like a sphere be close enough on my end? I doubt they’re triangle meshes if you’re using NoAccel for each.

Also, since you’re using OptiX 3.9.1 (?) right now, could you download OptiX 4.0.2 as a quick test to see if the behavior is the same? I want to make sure this is not a bug that’s been fixed already.

And just to confirm, the memory is growing (linearly?) during the loop, right? Even before you compile?

It is a combination of triangles and rectangles, so a sphere would be comparable. It is not a triangular mesh exactly, but rather scattered triangles.

I tried OptiX 4.0, and it is a little worse. It uses roughly 2GB more host memory, so that loop now costs over 13GB of host memory.

Correct, the growth is roughly linear all the way through the loop. I should also note that I also print the device memory during the loop, and it shows <1MB of increase.

Good to know that it exists in OptiX 4.0+, thanks.

This question is tangential to the memory issue, but what is the original motivation to store triangles/rectangles as separate instances rather than storing them all in a single geometry group with flattened transforms? Do all the triangles change on every frame, and you’re worried about dynamic rebuilds (which are still pretty fast with Trbvh)? Traversal should be noticeably faster if they were in a single Accel.

My understanding was that you could only associate one transformation with each geometry group. However, it sounds like there is a way around that? I want a separate transformation for each element, which avoids me having to pass vertex and other information to the device. If it is possible, can you point me to an example that shows how to add a single instance of single group that contains many elements with different transforms?

Right, you cannot associate multiple transform nodes with different subsets of a geometry group; the transform applies to the entire group.

What I’m asking really is why you don’t apply the transforms to the triangles/rectangles yourself ahead of time on the host, then pack the transformed triangles into a buffer in a single GeometryGroup – no more transform nodes in OptiX. This would be faster to trace and use less memory. The downside is longer BVH rebuilds/refits if the triangles are animated per frame.

The dynamicGeometry sample shows this tradeoff between rebuild time and trace time.

Hmm, I suppose I could do that. It would be a little cumbersome in the code because I actually have five possible primitive element types, which all require different data to define them (e.g., a triangle is three (x,y,z) vertices, whereas a disk is an (x,y,z) center, major/minor radii, and a spherical rotation. So I would just need separate buffers for each primitive type, plus I would also need another buffer that maps the primitive’s global index to the index in the geometry buffer for that primitive type. I assumed that the point of defining the transform matrix was to be able to define a generic triangle/rectangle/etc. on the device, which would be transformed automatically during the ray trace. It is nice not having to make all those different buffers and pass things around, but it does take more memory as you said (9 floats for triangle, 12 floats for a rectangle - vs. - 16 floats for transform matrix).

There may also be the problem with BVH rebuilds for me. Although I usually won’t be rebuilding every frame, there will be rebuilds so it would be worthwhile for me to explore the tradeoffs.

Anyway, I suspect that possibly the problem is that I’m using OptiX in a way that was not originally intended, and that there is a substantial memory overhead associated with defining groups/instances. Since you would probably typically only ever define on the order of maybe a hundred instances, you would likely never notice this overhead. However, for millions of them, it seems to be quite substantial. Would you agree?

Transforms are more typically used to instance larger amounts of geometry, e.g., you make a forest by taking a couple of unique trees (the geometry groups) and putting hundreds of transforms over them. It’s not typical to put each triangle under a transform, but I still want to understand why this is using so much memory.

If you want to switch to the more standard setup, perhaps you could make one geometry group for each of your 5 primitive types, just to make it easier to keep the buffers and intersection/bounds programs straight. Each geometry group has its own primitive count and local primitive ids. If you rely on a global id, you might need a reverse lookup table (buffer) for local->global ids.

For rebuilds, the “Trbvh” builder type is by far the fastest since it uses the device. I would expect something in the 0.5-1 second range for 2M custom primitives (faster for pure triangles) on your hardware above. If some of your geometry is static, it could go into its own group and save on rebuild time.

I had a chance to do a re-write of my code, with the implementation such that there is a single transform, geometry group, and acceleration, with an instance and geometry for each of my five different primitive types.

As expected, the host memory usage is considerably lower than before, using only about 500MB for the 2 million primitives case. The device memory is also quite a bit lower than before, which is also great. The biggest gains have come in term of acceleration build plus context compile time, which went from over a minute to maybe a second. There is nearly an order of magnitude decrease in traversal time as well. My upper limitation now seems to be with my own API using too much host memory, as the memory usage by OptiX both on the host and device are now relatively small.

A strange thing I now see is that somehow the scaling to large scenes is better than linear. It is almost perfectly linear up to about 5 million primitives, then the slope starts to noticeably decrease. Is it likely that is just because the acceleration algorithms are eventually figuring out an even more efficient way to process the geometry?

Thanks for your help!

I’d expect memory usage to be roughly linear in number of primitives; not sure why it falls off after 5M prims.

I’ll follow up internally with the OptiX devs regarding your original question about host memory for instances. I got similar numbers for a test scene using spheres as instances. The host memory looks mostly related to OptiX scene graph management, not so much user data like the 16 floats for a transform.

I’d recommend continuing with the more “mainstream” scene setup since it seems to be working out well for both memory and time. Let us know if you hit anything else.