Hey everyone,
I’m building an out-of-band network anomaly detection system that runs a Selective State Space Model (Mamba-3) on live packet flows. Flow-level features only — no payload inspection, zero decryption overhead.
Currently running a split-brain pipeline on an RTX 4090:
-
C++ eBPF Harvester intercepts at the NIC via XDP, extracts feature vectors, and writes to
/dev/shmvia DMA -
PyTorch inference maps shared memory via
torch.frombuffer()— true zero-copy IPC, bypassing the Python GIL -
Pre-compiled CUDA Graph fires on a semaphore flip
Hitting a verified 40 Mpps compute ceiling at 1.6ms per 64K-packet batch. At Batch 256 we hit a hard PCIe Gen4 bottleneck transferring 5.86MB of shared memory — the GPU inference itself isn’t the limit, the bus is.
We’re currently going through the NVIDIA Inception application process (domain vixdev.cloud) and evaluating BlueField-3 as our next hardware target — trying to get the architecture decisions right before we have access to the actual silicon.
Two paths we’re considering:
-
On-DPU inference — run the SSM core directly on BF3’s ARM cores, eliminating the host PCIe transfer entirely. Concern: ARM throughput vs CUDA Graph on host GPU for a continuous-time sequence model.
-
Host GPU offload via GPUDirect RDMA — BlueField handles XDP ingestion and feature extraction, tensors pushed directly to host GPU bypassing System RAM via ConnectX-7. Projected ceiling: 128 Mpps.
For a latency-sensitive SSM inference workload specifically, has anyone evaluated whether BF3’s ARM cores are viable for on-DPU inference, or does host GPU offload via GPUDirect remain the only practical path at line rate?
Happy to share more on the architecture.
Thanks,
Vickson | Founder, Vixero Technology Enterprise
vixdev.cloud