SSM inference on BlueField-3

Hey everyone,

I’m building an out-of-band network anomaly detection system that runs a Selective State Space Model (Mamba-3) on live packet flows. Flow-level features only — no payload inspection, zero decryption overhead.

Currently running a split-brain pipeline on an RTX 4090:

  • C++ eBPF Harvester intercepts at the NIC via XDP, extracts feature vectors, and writes to /dev/shm via DMA

  • PyTorch inference maps shared memory via torch.frombuffer() — true zero-copy IPC, bypassing the Python GIL

  • Pre-compiled CUDA Graph fires on a semaphore flip

Hitting a verified 40 Mpps compute ceiling at 1.6ms per 64K-packet batch. At Batch 256 we hit a hard PCIe Gen4 bottleneck transferring 5.86MB of shared memory — the GPU inference itself isn’t the limit, the bus is.

We’re currently going through the NVIDIA Inception application process (domain vixdev.cloud) and evaluating BlueField-3 as our next hardware target — trying to get the architecture decisions right before we have access to the actual silicon.

Two paths we’re considering:

  1. On-DPU inference — run the SSM core directly on BF3’s ARM cores, eliminating the host PCIe transfer entirely. Concern: ARM throughput vs CUDA Graph on host GPU for a continuous-time sequence model.

  2. Host GPU offload via GPUDirect RDMA — BlueField handles XDP ingestion and feature extraction, tensors pushed directly to host GPU bypassing System RAM via ConnectX-7. Projected ceiling: 128 Mpps.

For a latency-sensitive SSM inference workload specifically, has anyone evaluated whether BF3’s ARM cores are viable for on-DPU inference, or does host GPU offload via GPUDirect remain the only practical path at line rate?

Happy to share more on the architecture.

Thanks,

Vickson | Founder, Vixero Technology Enterprise

vixdev.cloud

1/ Option 1 – SSM on BlueField-3 Arm cores

Viable only for relatively small / lightweight models or simple pre-filters. It’s not a realistic primary inference path at the rates you’re targeting.

2/ Option 2 – Host GPU offload via GPUDirect RDMA

This is the practical and recommended direction: use BlueField-3 for packet ingest + feature extraction, and use the host GPU for the SSM inference.

For detailed sizing and architecture guidance, please engage NVIDIA through your sales or enterprise support channel.

Appreciate the clarity jsl2! Going with Option 2 then – BF3 handles ingestion, GPUDirect RDMA to host GPU for SSM. Will reach out through Inception for the sizing details. Thanks!