Inter-process synchronization INSIDE CUDA kernel

Is this possible?

I have a machine with 8 GPUs, and a program launched via MPI. I was assuming this could perhaps be done with some volatile GPU memory that is shared among local peers via cudaIPC* APIs, and use atomic operations with system scope to achieve this.

However I found it to not work. Below is a simple kernel to demonstrate this(the most concise version, launched with only 1 block and 1 thread for each process). It looks like I had to cast away the volatileness because atomic* APIs do not support volatile parameters, although that’s just my guess and I am not 100% certain that is the reason.

__global__ void inter_process_atomic_kernel(volatile unsigned int* gpu_ipc_count,
                                            int num_local_gpus) {
  // was expecting each process to see a different value between [0, num_local_gpus - 1]
  unsigned int old = atomicInc_system((unsigned int*)gpu_ipc_count, 1);
  printf("Old value : %d\n", old);
}

Output:

[1,1]<stdout>:Old value : 0
[1,0]<stdout>:Old value : 1
[1,7]<stdout>:Old value : 0
[1,4]<stdout>:Old value : 1
[1,2]<stdout>:Old value : 0
[1,3]<stdout>:Old value : 1
[1,6]<stdout>:Old value : 0
[1,5]<stdout>:Old value : 1

If this is not feasible, is there any way synchronization could be achieved INSIDE CUDA kernels?

you probably want to use (or are thinking about) atomicAdd, not atomicInc

atomicInc increments a location (that is, it adds 1 to it) and applies the rollover value you specify. Since you have a rollover value of 1 specified, the location will be reset to zero when you increment past 1. The returned value can therefore only be 0 or 1.

Try doing atomicAdd(..., 1) instead. You might wish to read the docs also.

this doesn’t have anything to do with volatile, which has nothing to do with atomics.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.