My system has two nodes with two GPUs per node:
node 0 → GPU 0, GPU 1
node 1 → GPU 0, GPU 1
I’ve been able to share data between GPUs on each node, using cudaDeviceEnablePeerAccess()
, but, so far, I have been unable to share data between the nodes. As a first approach, I am trying to use ncclSend() and ncclReceive(). I’ve tried to boil it down to the simple task of sending a double from node 0, GPU 1, to node 1, GPU 0. I’ve followed example 3 here to initialize MPI and NCCL. As I understand it, each node will have an MPI rank, 0 and 1, and each GPU will have a nccl global rank; 0,1,2,3. So I initialize MPI with:
// process info
int myRank, nRanks = 0;
//initializing MPI
MPICHECK(MPI_Init(&argc, &argv));
MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &myRank));
MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &nRanks));
and nccl with:
//each process is using two GPUs
int nDev = 2;
double** sendbuff = (double**)malloc(nDev * sizeof(double*));
double** recvbuff = (double**)malloc(nDev * sizeof(double*));
cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev);
//get NCCL unique ID at rank 0 and broadcast it to all others
ncclUniqueId id;
ncclComm_t comms[nDev];
if (myRank == 0) ncclGetUniqueId(&id);
MPICHECK(MPI_Bcast((void *)&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
// initializing NCCL, group API is required around ncclCommInitRank as it is
// called across multiple GPUs in each thread/process
// nccl ranks use a global numbering
NCCLCHECK(ncclGroupStart());
for (int i=0; i<nDev; i++) {
CUDACHECK(cudaSetDevice(i));
NCCLCHECK(ncclCommInitRank(comms+i, nRanks*nDev, id, myRank*nDev + i));
}
NCCLCHECK(ncclGroupEnd());
Again, most of this is straight from the example. I then populate the sendbuff
on node 0, GPU 1, and allocate memory on the recvbuff
for node 1, GPU 0, see below:
// value to send
double d1 = 5.25;
double *d2;
// populate buffers
if (myRank == 0) {
// set device
CUDACHECK(cudaSetDevice(1));
// allocate
CUDACHECK(cudaMalloc(sendbuff+1, sizeof(double)));
CUDACHECK(cudaMalloc(recvbuff+1, sizeof(double)));
// populate sendbuff
CUDACHECK(cudaMemcpy(sendbuff+1,&d1,sizeof(double),cudaMemcpyHostToDevice));
}
else if (myRank == 1) {
// set device & allocate memory to receive buffer
CUDACHECK(cudaSetDevice(0));
CUDACHECK(cudaMalloc(recvbuff, sizeof(double)));
}
myRank
being the MPI rank (node). At this point, all I want to do is send that double that I’ve put on the node 0 GPU 1 sendbuff to the other node. To do so I have:
// send value from node 0 to node 1
NCCLCHECK(ncclGroupStart());
if (myRank == 0) {
cudaSetDevice(1);
NCCLCHECK(ncclSend(sendbuff+1, sizeof(double), ncclDouble, 3, comms[1], s[1]));
}
else if (myRank == 1) {
cudaSetDevice(0);
NCCLCHECK(ncclRecv(recvbuff, sizeof(double), ncclDouble, 2, comms[0], s[0]));
}
NCCLCHECK(ncclGroupEnd());
But here I get an error Failed, NCCL error jacobi_cuda.cu:798 'invalid usage
, where line 798 is just the NCCLCHECK(ncclGroupEnd());
statement. I have checked and re-checked the documentation and all of the arguments/typing seem correct to me. Is there something more fundamental I’m missing here? Should I just try a different communication mechanism like ncclBroadcast()
or some type of ncclGather()
?
Ultimately I need to share 2D data at every iteration (poisson problem), so I thought ncclSend()
and ncclRecv()
would be most appropriate for that.