How to use NCCL to communicate between nodes?

My system has two nodes with two GPUs per node:

node 0 → GPU 0, GPU 1
node 1 → GPU 0, GPU 1

I’ve been able to share data between GPUs on each node, using cudaDeviceEnablePeerAccess(), but, so far, I have been unable to share data between the nodes. As a first approach, I am trying to use ncclSend() and ncclReceive(). I’ve tried to boil it down to the simple task of sending a double from node 0, GPU 1, to node 1, GPU 0. I’ve followed example 3 here to initialize MPI and NCCL. As I understand it, each node will have an MPI rank, 0 and 1, and each GPU will have a nccl global rank; 0,1,2,3. So I initialize MPI with:

    // process info
    int myRank, nRanks = 0;

    //initializing MPI
    MPICHECK(MPI_Init(&argc, &argv));
    MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &myRank));
    MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &nRanks));

and nccl with:

    //each process is using two GPUs
    int nDev = 2;
    double** sendbuff = (double**)malloc(nDev * sizeof(double*));
    double** recvbuff = (double**)malloc(nDev * sizeof(double*));
    cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev);

    //get NCCL unique ID at rank 0 and broadcast it to all others
    ncclUniqueId id;
    ncclComm_t comms[nDev];
    if (myRank == 0) ncclGetUniqueId(&id);
    MPICHECK(MPI_Bcast((void *)&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));

    // initializing NCCL, group API is required around ncclCommInitRank as it is
    // called across multiple GPUs in each thread/process
    // nccl ranks use a global numbering
    NCCLCHECK(ncclGroupStart());
    for (int i=0; i<nDev; i++) {
        CUDACHECK(cudaSetDevice(i));
        NCCLCHECK(ncclCommInitRank(comms+i, nRanks*nDev, id, myRank*nDev + i));
    }
    NCCLCHECK(ncclGroupEnd());

Again, most of this is straight from the example. I then populate the sendbuff on node 0, GPU 1, and allocate memory on the recvbuff for node 1, GPU 0, see below:

    // value to send
    double d1 = 5.25;
    double *d2;
    // populate buffers
    if (myRank == 0) {
        // set device
        CUDACHECK(cudaSetDevice(1));

        // allocate
        CUDACHECK(cudaMalloc(sendbuff+1, sizeof(double)));
        CUDACHECK(cudaMalloc(recvbuff+1, sizeof(double)));

        // populate sendbuff
        CUDACHECK(cudaMemcpy(sendbuff+1,&d1,sizeof(double),cudaMemcpyHostToDevice));
    }
    else if (myRank == 1) {
        // set device & allocate memory to receive buffer
        CUDACHECK(cudaSetDevice(0));
        CUDACHECK(cudaMalloc(recvbuff, sizeof(double)));
    }

myRank being the MPI rank (node). At this point, all I want to do is send that double that I’ve put on the node 0 GPU 1 sendbuff to the other node. To do so I have:

    // send value from node 0 to node 1
    NCCLCHECK(ncclGroupStart());
    if (myRank == 0) {
        cudaSetDevice(1);
        NCCLCHECK(ncclSend(sendbuff+1, sizeof(double), ncclDouble, 3, comms[1], s[1]));
    }
    else if (myRank == 1) {
        cudaSetDevice(0);
        NCCLCHECK(ncclRecv(recvbuff, sizeof(double), ncclDouble, 2, comms[0], s[0]));
    }
    NCCLCHECK(ncclGroupEnd());

But here I get an error Failed, NCCL error jacobi_cuda.cu:798 'invalid usage, where line 798 is just the NCCLCHECK(ncclGroupEnd()); statement. I have checked and re-checked the documentation and all of the arguments/typing seem correct to me. Is there something more fundamental I’m missing here? Should I just try a different communication mechanism like ncclBroadcast() or some type of ncclGather()?

Ultimately I need to share 2D data at every iteration (poisson problem), so I thought ncclSend() and ncclRecv() would be most appropriate for that.