Omniverse Flow for Point Clouds on Multi-GPU Servers vs Single-GPU Workstations - Optimization & Performance

Hello,

I have a concern about the optimization of the flow plugin – specifically for point clouds. I’m noticing some concerning patterns here when trying to handle very large-scale data sets – They don’t seem to be scaling very well onto multi-GPU hardware.

Here’s my experiment from a from a workstation with 1 NVIDIA A6000

Inputs (USD File is 20,755,622 KB):

  • Point cloud with ~885,000,000 points
  • Max Blocks set to: 650,000 (Tested all the way up to 950,000 and it seemed to scale with system RAM with no issues)
  • Cell size set to: 0.59455 (auto cell size enabled to find this number then subsequently disabled)

Results (update_while_paused unchecked):

  • RAM Used by Kit in Steady-state RTX Real-time: 15500 MB
  • Dedicated VRAM Usage: 41.2/47.5 GB

And here’s the same experiment from our server with 8 RTX8000 GPUs

Inputs (USD File is 20,755,622 KB):

  • Point cloud with ~885,000,000 points
  • Max Blocks set to: 650,000 (Going past 700,000 caused renderer to be unresponsive)
  • Cell size set to: 0.59457 (auto cell size enabled to find this number then subsequently disabled - ever so slightly different from the other machine but still within a very small margin of error)

Results (update_while_paused unchecked):

  • RAM Used by Kit in Steady-state RTX Real-time: 15473 MB
  • Dedicated VRAM Usage (GPU0): 40.5/48 GB
  • Dedicated VRAM Usage (GPU1): 36.9/48 GB
  • Dedicated VRAM Usage (GPU2): 36.9/48 GB
  • Dedicated VRAM Usage (GPU3): 36.9/48 GB
  • Dedicated VRAM Usage (GPU4): 36.9/48 GB
  • Dedicated VRAM Usage (GPU5): 36.9/48 GB
  • Dedicated VRAM Usage (GPU6): 36.9/48 GB
  • Dedicated VRAM Usage (GPU7): 37.1/48 GB

Why are so many of the GPUs at a significant usage of VRAM?

Can anyone help me understand what’s going on here and how I might be able to further turn up the ‘max blocks’ setting in flow taking advantage of multiple GPUs? It seems the upper bound is the same (actually greater according to testing) for 1 GPU (admittedly a higher-tier one) as it is for 8 which is puzzling and frustrating. I was able to continue cranking the max blocks up on the single A6000 well beyond the ability of the multi-GPU system (server would just show a black screen in the viewport - no points rendered at all)

I can’t tell if the issue is on my end or not. Create has my favorite point cloud renderer - if there’s anything I can do or provide that could help improve it please let me know.

Thank you for reading!

-Matthew

Hello @LMTraina99! I informed the development team of your findings and created an internal development ticket to address your concerns. I will try to grab a developer to address your concerns here as well!

1 Like

(An internal development ticket was created for this post: OM-47645: Omniverse Flow for Point Clouds on Multi-GPU Servers vs Single-GPU Workstations - Optimization & Performance]

1 Like

Just wanted to let you know that the development team is digging into this issue. Right now, the code for flow is distributing the memory usage across GPUs for point clouds. The team is looking into how they can improve this.

1 Like

Glad to hear! I hope to see improvements in the future.

I think I’ve seen that the newest Create version has added support for rendering UsdGeomPoints directly. I’ve yet to investigate how this performs (or if it still uses flow as a go-between) - I’m thinking this could be more easily optimized (or already is) because it’s a native data structure to USD. I assume it still uses the same voxel-based conversion for rendering,

I want to say again that my criticism isn’t without great gratitude to the developers here - Create is a fantastic point cloud renderer compared to other openly accessible options and its performance is impressive for the graphical fidelity even on only one GPU.

Cheers!

-Matthew