Hello, I am having trouble with running DPA threads. I want to run memcpy operations on DPA threads using async ops to avoid blocking the host execution. What I first tried is creating a DPA thread and notification completion context, then attaching the DPA thread to the notification completion context and notifying the notification context through an RPC that runs on the DPA engine, but this did not trigger the thread (even after starting the notification context and running thread_start and thread_run on the thread). Now I’m trying to use a regular completion context instead, but I can’t find any function that sends a completion on the completion context. Does anyone know how to do this? If there is a better way to trigger DPA threads, could you please share your steps and possible fixes?
You’re on the right track using DPA threads with async ops for memcpy to avoid blocking the host. A quick clarification on how thread activation and completions work in DOCA DPA:
A DPA thread is usually associated with a DPA completion context. When an operation (RDMA, async memcpy, etc.) is issued with completion requested, the hardware posts a completion on that context, and if that context is attached to a DPA thread, the thread is scheduled. You don’t normally “manually send” a completion into a generic completion context from the host.
To align with the intended flow, I’d suggest checking the following for your DOCA version:
DOCA DPA documentation https://docs.nvidia.com/doca/sdk/doca+dpa/index.html
Then search within that page for: DPA Completion Context, DPA Async Ops, and for the “Basic Initiator Target” / “Advanced Initiator Target” samples, which show DPA threads attached to completion contexts and triggered by completions.
DOCA Libraries API reference https://docs.nvidia.com/doca/api/2.9.0/pdf/doca-libraries-api.pdf
Search for doca_dpa_completion_*, doca_dpa_notification_completion_*, and doca_dpa_async_ops_* to see the exact semantics and lifecycle of completion contexts vs. notification completions vs. async ops.
In particular,
Your DPA thread and DPA completion context are created on the same DPA context, the completion context is attached to the thread, and both are started before thread_run.
Your async memcpy is issued with completion requested on a DPA async ops object attached to that completion context, so the completion generated by the memcpy can actually wake the DPA thread instead of needing a “manual send‑completion”.