Is this possible?
I have a machine with 8 GPUs, and a program launched via MPI. I was assuming this could perhaps be done with some volatile GPU memory that is shared among local peers via cudaIPC* APIs, and use atomic operations with system scope to achieve this.
However I found it to not work. Below is a simple kernel to demonstrate this(the most concise version, launched with only 1 block and 1 thread for each process). It looks like I had to cast away the volatileness because atomic* APIs do not support volatile parameters, although that’s just my guess and I am not 100% certain that is the reason.
__global__ void inter_process_atomic_kernel(volatile unsigned int* gpu_ipc_count,
int num_local_gpus) {
// was expecting each process to see a different value between [0, num_local_gpus - 1]
unsigned int old = atomicInc_system((unsigned int*)gpu_ipc_count, 1);
printf("Old value : %d\n", old);
}
Output:
[1,1]<stdout>:Old value : 0
[1,0]<stdout>:Old value : 1
[1,7]<stdout>:Old value : 0
[1,4]<stdout>:Old value : 1
[1,2]<stdout>:Old value : 0
[1,3]<stdout>:Old value : 1
[1,6]<stdout>:Old value : 0
[1,5]<stdout>:Old value : 1
If this is not feasible, is there any way synchronization could be achieved INSIDE CUDA kernels?