I’m new to distributed computing in CUDA (CUDA-MPI versions).
I’m working on a project that includes multiple processes (each process handles 1 GPU) where I compute a value for a variable (say x) (written in GPU memory) in one of the processes. I want to pass the updated variable to other processes. The other processes need to update their local x from the sending process.
My implementation can tolerate delay in the synchronization in terms of accuracy, however the performance will be affected.
So my questions are:
- Is there any way I can send the updated value from inside the CUDA kernel. If possible, how will that affect the performance?
- I tried looking online but had a few resources that explains properly with some exampled. Could you please direct me to possible resources (if possible example codes) that might help with this issue?
The closest technology to this would be GPUDirect Async, which is still in technology preview stage.
Instead, a more mature technology would be CUDA-aware MPI, which allows transmission of variables directly from GPU memory on one system to GPU memory on another system. However the transfer is triggered by host code (e.g. MPI_Sendrecv).
Thank you very much for the information. I’ve a quick question regarding triggering of MPI transfer from the host code. As we launch GPU kernel from the CPU, is there any way I can call a CPU function by one of the threads in GPU (I could trigger MPI_Sendrecv from that function)? If possible I want to avoid relaunching the kernel given that it can be expensive in my case.
You cannot call a CPU function from a thread on the GPU, at least not directly. (you could have a complicated memory mailbox polling scheme).
You can issue a CUDA stream callback which will call a CPU function at a particular point in CUDA stream execution, but it would be after a kernel execution, there would be no way to precisely trigger it during a kernel execution.
I understand. Thank you very much for your suggestions!