Hello everyone,
I am currently developing on the BlueField-2/3 DPU platform and am trying to implement a specific function: to have the DPU’s Arm cores directly initiate read/write access to the memory of a local GPU installed on the same host, with the data path completely bypassing the host CPU.
My goal is to use the DPU Arm as a co-processor that can directly operate on data in GPU memory, aiming to achieve the lowest possible communication latency.
I have already reviewed the following documentation:
- The NVIDIA DOCA DMA and GPUNetIO library.
- Technical documents related to GPUDirect.
I haven’t been able to find a clear programming example (sample) or an authoritative guide on how to initiate direct access to local GPU memory from the DPU’s Arm cores.
Additionally, I came across the paper “Conspirator: SmartNIC-Aided Control Plane for Distributed ML Workloads” (https://www.usenix.org/system/files/atc24-xiao.pdf), which mentions a “SNIC DMA to GPU” technical path. This makes me more confident that direct communication between the DPU and a local GPU over the PCIe bus is theoretically possible.
Therefore, I would like to ask the community and official experts:
- On the BlueField-2/3 platform, is it possible for the DPU Arm cores to perform DMA operations directly on the memory of a GPU on the same host?
- If so, is there any official sample code, tutorial, or detailed documentation that could guide this implementation?
- What key DOCA libraries, APIs, or driver configurations are required to achieve this?