Estimating bandwidth and throughput for HPC Edge applications

ben.stensland · February 20, 2024, 4:16pm

I’m trying to estimate how many NICs/GPUs would be required for a particular data ingest, transform, and distribution application.

It’s a similar application to the HoloHub example NVIDIA Holohub (nvidia-holoscan.github.io) and I’d like to get a better of idea of scalability and bandwidth limitations. What would expected maximum throughput for a ConnectX-7 NIC and A30 GPU? Would they achieve the 400 Gbps limit of the X-7?

adamt · February 20, 2024, 5:24pm

Hi @ben.stensland – this is a great question, and thanks for posting on the forums.

The Network Radar Pipeline you that referenced uses the Advanced Networking Operator (ANO) that’s available in Holohub. The goal of this operator is to provide both high bandwidth and low latency packet transfers from the NIC to the GPU, abstracting away much of the implementation detail.

We have worked with a number of customers who are able to combine I/O with compute at 200Gbps on an IGX system (CX7 and A6000 dGPU). We have also shown that 400Gbps to GPU is possible with the ANO, but doing this requires additional optimizations.

Without knowing more information about your application and specific requirements (e.g. latency, ability to split data feeds, etc), I think you could safely plan for 200Gbps per 1 GPU for a low latency GPU-accelerated real time pipeline.

Of course, we’re always here to talk through specifics about your own application.

ben.stensland · February 20, 2024, 6:45pm

Thanks for the quick response! In our scenario, we have several hundred incoming data streams, each carrying about 1.6 Gbps of I/Q which is decimated and processed on a per-stream basis (e.g. embarrassingly parallel). These reduced streams (~0.01 Gbps) are then sent back out to another data sink on the network. From ingest to client data sink, latency should be <0.1s.

Would you expect ~126 streams per CX7/A600 dGPU pair? How intensive are the optimizations to achieve the 400 Gbps rate?

adamt · February 21, 2024, 3:48pm

Hey @ben.stensland – Looking closer at the A30 GPU, you’re limited to 16 lanes of PCIe Gen4 (64GB/sec bi-directional), meaning peak bandwidth in one direction would be 32GBps * 8 b/B = 256Gbps. That said, we’d love to chat with you more in-depth about your application and system design. I’ll send a note to you with my contact information, and we can take the conversation offline.

ben.stensland · February 28, 2024, 1:59pm

Hey @adamt I’ve tried sending you a couple of emails to set-up a call. Are they maybe getting stuck in a filter? The originating email would start with ben.stensland@ . Thanks

adamt · February 28, 2024, 4:11pm

Oh no! Following up in DM, @ben.stensland

Topic		Replies	Views
Streaming "raw" samples over 100Gbps ConnectX-6 channel into Holoscan Operator Holoscan SDK ai	1	992	March 29, 2023
Interpretation of "total aggregate bandwidth" for HGX A100 CUDA Programming and Performance a100	9	2780	June 3, 2024
GRID M6 bandwidth requirements/observations NVIDIA Virtual GPU Technology	4	8557	March 23, 2017
How to measure bandwidth from pinned host memory to device memory on aws A100(p4d.24xlarge)? GPU - Hardware pcie	1	940	September 7, 2022
RDMA GPU Direct Slow CUDA Programming and Performance	10	2418	February 13, 2019
How can I improve the 'p2p enabled' bandwidth when testing NCCL performance with two A5000 GPU using PCIe 4.0 x16? CUDA Programming and Performance cuda	2	1136	September 15, 2023
H<->D memcpy bottleneck for multi-thread application CUDA Programming and Performance	4	1846	September 12, 2018
Can't achieve 400Gbps using ConnectX-7 Ethernet Adapter Cards ethernet , networking	2	117	February 25, 2025
A100 simplemulticopy CUDA Programming and Performance	14	115	August 23, 2024
Connectx-6 DX maximum rivermax st2110 throughput RDMA Software For GPU	3	647	November 22, 2023

Estimating bandwidth and throughput for HPC Edge applications

Related topics