What counts as an AMO in NVSHMEM?

Hi there,

I have an issue with the use of GDRCopy in NVSHMEM with UCX support. According to the installation guide [0] you need GDRCopy for atomics support on non-NVLink connections. Due to an issue with the GDRCopy installation on the cluster that I want to run my NVSHMEM applications on, I decided to refactor my code to remove any atomic memory operations from it (since strictly speaking I don’t necessarily need them for my use case).

But then it occurred to me that I am not using any atomics in the first place! Because according to the NVSHMEM API Documentation [1] only the API calls starting with nvshmem_atomic are considered to be AMO. In my code I am simply using RMA operations such as nvshmemx_double_block_nbi_put, signaling operations like nvshmemx_signal_op and collective calls like nvshmem_fcollect. Therefore I decided to test it with an example provided by NVIDIA which you can find below:

NVSHMEM Put On Block Example
#include <stdio.h>
#include <assert.h>
#include "nvshmem.h"
#include "nvshmemx.h"

#undef CUDA_CHECK
#define CUDA_CHECK(stmt)                                                          \
do {                                                                          \
cudaError_t result = (stmt);                                              \
if (cudaSuccess != result) {                                              \
fprintf(stderr, "[%s:%d] cuda failed with %s \n", __FILE__, __LINE__, \
cudaGetErrorString(result));                                  \
exit(-1);                                                             \
}                                                                         \
} while (0)

#define THREADS_PER_BLOCK 1024

__global__ void set_and_shift_kernel(float *send_data, float *recv_data, int num_elems, int mype,
                                     int npes) {
    int thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
    /* set the corresponding element of send_data */
    if (thread_idx < num_elems) send_data[thread_idx] = mype;

    int peer = (mype + 1) % npes;
    /* Every thread in block 0 calls nvshmemx_float_put_block. Alternatively,
       every thread can call shmem_float_p, but shmem_float_p has a disadvantage
       that when the destination GPU is connected via IB, there will be one rma
       message for every single element which can be detrimental to performance.
       And the disadvantage with shmem_float_put is that when the destination GPU is p2p
       connected, it cannot leverage multiple threads to copy the data to the destination
       GPU. */
    int block_offset = blockIdx.x * blockDim.x;
    nvshmemx_float_put_block(recv_data + block_offset, send_data + block_offset,
                             min(blockDim.x, num_elems - block_offset),
                             peer); /* All threads in a block call the API
                                       with the same arguments */
}

int main(int c, char *v[]) {
    int mype, npes, mype_node;
    float *send_data, *recv_data;
    int num_elems = 8192;
    int num_blocks;

    nvshmem_init();

    mype = nvshmem_my_pe();
    npes = nvshmem_n_pes();
    mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);

    // application picks the device each PE will use
    CUDA_CHECK(cudaSetDevice(mype_node));
    send_data = (float *)nvshmem_malloc(sizeof(float) * num_elems);
    recv_data = (float *)nvshmem_malloc(sizeof(float) * num_elems);
    assert(send_data != NULL && recv_data != NULL);

    assert(num_elems % THREADS_PER_BLOCK == 0); /* for simplicity */
    num_blocks = num_elems / THREADS_PER_BLOCK;

    set_and_shift_kernel<<<num_blocks, THREADS_PER_BLOCK>>>(send_data, recv_data, num_elems, mype,
                                                            npes);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    /* Do data validation */
    float *host = new float[num_elems];
    CUDA_CHECK(cudaMemcpy(host, recv_data, num_elems * sizeof(float), cudaMemcpyDefault));
    int ref = (mype - 1 + npes) % npes;
    bool success = true;
    for (int i = 0; i < num_elems; ++i) {
        if (host[i] != ref) {
            printf("Error at %d of rank %d: %f\n", i, mype, host[i]);
            success = false;
            break;
        }
    }

    if (success) {
        printf("[%d of %d] run complete \n", mype, npes);
    } else {
        printf("[%d of %d] run failure \n", mype, npes);
    }

    nvshmem_free(send_data);
    nvshmem_free(recv_data);

    nvshmem_finalize();

    return 0;
}

As you can see, this code only uses the nvshmemx_float_put_block API call, but I get the same error message:

src/comm/transports/ucx/ucx.cpp:nvshmemt_ucx_remote_amo:853: AMO 8 not supported with the current configuration (GDRCopy Disabled)
src/comm/proxy/proxy.cu:process_channel_amo:471: aborting due to error in process_channel_dma

Does NVSHMEM convert that RMA operation into an atomic operation? And if so, how can I disable it? Or is it simply not possible to use NVSHMEM without GDRCopy on non-NVLink connections?

I’d appreciate any help!

Thanks,
Pascal

[0] Installation Guide :: NVSHMEM Documentation
[1] NVIDIA OpenSHMEM Library (NVSHMEM) Documentation — NVSHMEM 2.6.0 documentation

1 Like

The statement in the installation guide seems to be accurate for the default IBRC transport.

I checked the UCX transport implementation and it seems that we would need the GDRCopy because the NVSHMEM_SIGNAL_SET is used in barrier implementation and this signal op requires GDRCopy when using UCX transport.