SSM inference on BlueField-3

Hey everyone,

I’m building an out-of-band network anomaly detection system that runs a Selective State Space Model (Mamba-3) on live packet flows. Flow-level features only — no payload inspection, zero decryption overhead.

Currently running a split-brain pipeline on an RTX 4090:

  • C++ eBPF Harvester intercepts at the NIC via XDP, extracts feature vectors, and writes to /dev/shm via DMA

  • PyTorch inference maps shared memory via torch.frombuffer() — true zero-copy IPC, bypassing the Python GIL

  • Pre-compiled CUDA Graph fires on a semaphore flip

Hitting a verified 40 Mpps compute ceiling at 1.6ms per 64K-packet batch. At Batch 256 we hit a hard PCIe Gen4 bottleneck transferring 5.86MB of shared memory — the GPU inference itself isn’t the limit, the bus is.

We’re currently going through the NVIDIA Inception application process (domain vixdev.cloud) and evaluating BlueField-3 as our next hardware target — trying to get the architecture decisions right before we have access to the actual silicon.

Two paths we’re considering:

  1. On-DPU inference — run the SSM core directly on BF3’s ARM cores, eliminating the host PCIe transfer entirely. Concern: ARM throughput vs CUDA Graph on host GPU for a continuous-time sequence model.

  2. Host GPU offload via GPUDirect RDMA — BlueField handles XDP ingestion and feature extraction, tensors pushed directly to host GPU bypassing System RAM via ConnectX-7. Projected ceiling: 128 Mpps.

For a latency-sensitive SSM inference workload specifically, has anyone evaluated whether BF3’s ARM cores are viable for on-DPU inference, or does host GPU offload via GPUDirect remain the only practical path at line rate?

Happy to share more on the architecture.

Thanks,

Vickson | Founder, Vixero Technology Enterprise

vixdev.cloud