SSM inference on BlueField-3

vickson · March 26, 2026, 12:34pm

Hey everyone,

I’m building an out-of-band network anomaly detection system that runs a Selective State Space Model (Mamba-3) on live packet flows. Flow-level features only — no payload inspection, zero decryption overhead.

Currently running a split-brain pipeline on an RTX 4090:

C++ eBPF Harvester intercepts at the NIC via XDP, extracts feature vectors, and writes to /dev/shm via DMA
PyTorch inference maps shared memory via torch.frombuffer() — true zero-copy IPC, bypassing the Python GIL
Pre-compiled CUDA Graph fires on a semaphore flip

Hitting a verified 40 Mpps compute ceiling at 1.6ms per 64K-packet batch. At Batch 256 we hit a hard PCIe Gen4 bottleneck transferring 5.86MB of shared memory — the GPU inference itself isn’t the limit, the bus is.

We’re currently going through the NVIDIA Inception application process (domain vixdev.cloud) and evaluating BlueField-3 as our next hardware target — trying to get the architecture decisions right before we have access to the actual silicon.

Two paths we’re considering:

On-DPU inference — run the SSM core directly on BF3’s ARM cores, eliminating the host PCIe transfer entirely. Concern: ARM throughput vs CUDA Graph on host GPU for a continuous-time sequence model.
Host GPU offload via GPUDirect RDMA — BlueField handles XDP ingestion and feature extraction, tensors pushed directly to host GPU bypassing System RAM via ConnectX-7. Projected ceiling: 128 Mpps.

For a latency-sensitive SSM inference workload specifically, has anyone evaluated whether BF3’s ARM cores are viable for on-DPU inference, or does host GPU offload via GPUDirect remain the only practical path at line rate?

Happy to share more on the architecture.

Thanks,

Vickson | Founder, Vixero Technology Enterprise

vixdev.cloud

Topic		Replies	Views
BlueField DPU hardware access BlueField inception , dli	0	6	March 27, 2026
[BF2/BF3] Can the DPU Arm Directly Access GPU Memory on the Same Host? Other	2	151	July 16, 2025
Bluefield-2 accessing GPU in the same server BlueField cuda , ai , gpu , gpu-computing	2	702	February 19, 2024
BlueField-X mode Cybersecurity bluefield-smart_nic	7	1127	February 1, 2023
Exploring GPUDirect Storage and Offload Capabilities in BlueField2/BlueField3 Ethernet Adapter Cards gds , bluefield-smartnic	2	731	March 21, 2024
From NIC to GPU. CUDA Programming and Performance	42	14228	August 21, 2025
Is it possible to offload the entire TCP protocol stack into BlueField? BlueField	2	709	December 6, 2023
Is there any other way for BF3 to access local SSD besides NVMf? BlueField	2	102	February 21, 2025
Cannot offload network to DPU over infiniband BlueField infiniband	1	108	March 31, 2025
Direct access to SSD on BlueField from Host Other bluefield-smart_nic	4	1447	October 25, 2023

SSM inference on BlueField-3

Related topics