memory usage in multi GPU system (NVLink) Linux

t_en · December 13, 2018, 9:56pm

Hi,
I’ve been testing Optix 5.1.1 for 3D visualization for a few days. My code is based on the optixMeshViewer example, where I have replaced the existing OBJ loader with a proprietary loader. I am using the example’s PTX files and I provide all the buffers that the PTX code expects. My geometry is a classic triangle soup, and I provide the vertices and the indexes, since I can see that the code computes a geometric normal on the fly if the normal buffer is empty, I don’t provide the normal buffer.
I have access to a machine running Linux with 4 Tesla V100 each with 32GB of memory, connected trough NVLink and I tried to test the memory usage on that machine. Since it’s running Linux the driver should automatically run in TCC mode.
If I use the RT_BUFFER_INPUT flag to create buffers, with a geometry of 145323936 triangles and “Bvh” acceleration structures, the memory occupation is:
-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 0 |
| N/A 38C P0 65W / 300W | 17366MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 40C P0 68W / 300W | 14038MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 39C P0 63W / 300W | 14038MiB / 32480MiB | 11% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 40C P0 69W / 300W | 14038MiB / 32480MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Is this the expected behavior? The limit that I can push from 1 GPU to 4 GPUs is only about 3 GBytes. Am I missing something?

droettger · December 14, 2018, 9:49am

That is to be expected, as long as you’re not even close to the 32 GB per board limit.

As long as there is no need to do peer-to-peer access over the NVLINK bridge, OptiX will load the geometry to all boards for better multi-GPU rendering performance.
Once it’ll hit memory limits, it will migrate buffers to individual boards and use peer-to-peer access.

Though this won’t work if all your geometry is in one big buffer! You’d need to split them up into smaller chunks of a few million primitives so that they can be migrated individually.

Please see this thread as well and esp. the links in comment #2 and #4:
[url]https://devtalk.nvidia.com/default/topic/1027203/?comment=5226059[/url]
(I would not recommend to use the progressive API though. It’s faster to do manual accumulation.)

The difference between the input and input_output buffers is that input_output buffers are not allocated on the device in all shipping OptiX versions. You should never need to allocate geometry attribute buffers as input_output buffers.
See more about that here: [url]https://devtalk.nvidia.com/default/topic/1036340/?comment=5264830[/url]

Use the Trbvh acceleration structure builder and if you’re only using triangles with float3 vertices and int3 indices, use the Acceleration properties to pick the specialized and faster builder implementation.
[url]https://devtalk.nvidia.com/default/topic/1022634/?comment=5211794[/url]
Example code for an interleaved vertex attribute format here:
[url]optix_advanced_samples/Application.cpp at master · nvpro-samples/optix_advanced_samples · GitHub

t_en · December 14, 2018, 7:04pm

Thank you for your reply. I will implement your recommended changes.

t_en · December 21, 2018, 9:15pm

Hi,
I made some changes to my test code, so I split the vertex buffer in multiple ones (approximately 50 to 200) depending on the loaded mesh.
I also changed the CUDA code so that vertex indexes are now computed on the fly instead of being stored and copied from the CPU code, to save some memory on the GPU.
I made some more tests and it seems like the acceleration structures occupy in any case way more memory than the vertex buffers, so even if they are moved across GPUs (in case of NVLink), they are not the bulkiest object in GPU memory.
This is some data I collected from a data set with 13,464,000 triangles (there are here 85 vertex buffers)
With 4 GPUs and NoAccel I have a memory usage of 1013MiB, 849MiB, 849MiB, 849MiB
With Bvh 2125MiB, 1815MiB, 1815MiB, 1815MiB
With Sbvh 2725MiB, 2415MiB, 2415MiB, 2415MiB
With Trbvh 3667MiB, 1815MiB, 1815MiB, 1815MiB

The bigger case with 145,323,936 triangles (here about 200 vertex buffers):
NoAccel 2903MiB, 2739MiB, 2739MiB, 2739MiB
Bvh 16087MiB, 12759MiB, 12759MiB, 12759MiB
In this case both Sbvh and Trbvh will go out of memory.

We would like to be able to predict whether a triangle mesh will fit into GPU memory, considering the various accerelation algorithm used (Bvh, Sbvh or Trbvh), is there a rule of thumb?
The second question would be: are the acceleration structures moved across GPUs (with NVLink)?
Thank you

droettger · December 31, 2018, 1:29pm

Maybe I wasn’t clear enough on the geometry partitioning into smaller blocks.
That’s not about the geometry nodes alone, it’s about the acceleration structures objects on the GeometryGroups above the GeometryInstances. If you’re using a single GeometryGroup in your scene graph with the 145 MTriangles case, that’s not going to change the acceleration structure when using a single or multiple Geometry nodes.

Assuming your out of memory errors are not on the host, you should be able to load 145 MTriangles into a 32 GB board just fine as seen from the Bvh case.
Though esp. the Trbvh builder has a high temporary memory overhead during build. See the Trbvh chunk_size acceleration property to overcome that here:
[url]http://raytracing-docs.nvidia.com/optix/guide/index.html#host#acceleration-structure-properties[/url]

But that shouldn’t be a problem if each Acceleration contains only a few million triangles, which can then be accessed via peer-to-peer, and yes, that is about the acceleration structures, attribute buffers and textures.

t_en · January 2, 2019, 11:02pm

Thank you for the clarification, I was able to visualize the 145MTriangles case using Trbvh. I was in fact using only one GeometryGroup.

Topic		Replies	Views
Large memory usage OptiX	3	799	June 14, 2022
OPTIX, acceleration structure requires too much space OptiX	10	2835	June 15, 2022
optixAccelBuild of an empty scene takes 1.2 GB of dedicated GPU memory on RTX 5000 ADA OptiX	4	152	September 12, 2024
Optix 6.5 - Multi-GPU OptiX	2	1356	June 14, 2022
Question about handling buffers when using multiple GPUs? OptiX	14	4076	June 15, 2022
Does optixAccelComputeMemoryUsage require vertex/index buffer pointers to be filled? OptiX	3	649	June 14, 2022
DirectX->Optix single geometry buffer or multiple? OptiX	4	2477	June 14, 2022
Running out of memory on host well before device OptiX	16	3073	June 14, 2022
Insufficient device memory. GPU does not support paging OptiX	15	5349	June 15, 2022
Is there a way to know how much GPU memory Optix will use? OptiX	2	713	November 14, 2023

memory usage in multi GPU system (NVLink) Linux

Related topics