Hi there,
I have an issue with the use of GDRCopy in NVSHMEM with UCX support. According to the installation guide [0] you need GDRCopy for atomics support on non-NVLink connections. Due to an issue with the GDRCopy installation on the cluster that I want to run my NVSHMEM applications on, I decided to refactor my code to remove any atomic memory operations from it (since strictly speaking I don’t necessarily need them for my use case).
But then it occurred to me that I am not using any atomics in the first place! Because according to the NVSHMEM API Documentation [1] only the API calls starting with nvshmem_atomic
are considered to be AMO. In my code I am simply using RMA operations such as nvshmemx_double_block_nbi_put
, signaling operations like nvshmemx_signal_op
and collective calls like nvshmem_fcollect
. Therefore I decided to test it with an example provided by NVIDIA which you can find below:
NVSHMEM Put On Block Example
#include <stdio.h>
#include <assert.h>
#include "nvshmem.h"
#include "nvshmemx.h"
#undef CUDA_CHECK
#define CUDA_CHECK(stmt) \
do { \
cudaError_t result = (stmt); \
if (cudaSuccess != result) { \
fprintf(stderr, "[%s:%d] cuda failed with %s \n", __FILE__, __LINE__, \
cudaGetErrorString(result)); \
exit(-1); \
} \
} while (0)
#define THREADS_PER_BLOCK 1024
__global__ void set_and_shift_kernel(float *send_data, float *recv_data, int num_elems, int mype,
int npes) {
int thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
/* set the corresponding element of send_data */
if (thread_idx < num_elems) send_data[thread_idx] = mype;
int peer = (mype + 1) % npes;
/* Every thread in block 0 calls nvshmemx_float_put_block. Alternatively,
every thread can call shmem_float_p, but shmem_float_p has a disadvantage
that when the destination GPU is connected via IB, there will be one rma
message for every single element which can be detrimental to performance.
And the disadvantage with shmem_float_put is that when the destination GPU is p2p
connected, it cannot leverage multiple threads to copy the data to the destination
GPU. */
int block_offset = blockIdx.x * blockDim.x;
nvshmemx_float_put_block(recv_data + block_offset, send_data + block_offset,
min(blockDim.x, num_elems - block_offset),
peer); /* All threads in a block call the API
with the same arguments */
}
int main(int c, char *v[]) {
int mype, npes, mype_node;
float *send_data, *recv_data;
int num_elems = 8192;
int num_blocks;
nvshmem_init();
mype = nvshmem_my_pe();
npes = nvshmem_n_pes();
mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
// application picks the device each PE will use
CUDA_CHECK(cudaSetDevice(mype_node));
send_data = (float *)nvshmem_malloc(sizeof(float) * num_elems);
recv_data = (float *)nvshmem_malloc(sizeof(float) * num_elems);
assert(send_data != NULL && recv_data != NULL);
assert(num_elems % THREADS_PER_BLOCK == 0); /* for simplicity */
num_blocks = num_elems / THREADS_PER_BLOCK;
set_and_shift_kernel<<<num_blocks, THREADS_PER_BLOCK>>>(send_data, recv_data, num_elems, mype,
npes);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
/* Do data validation */
float *host = new float[num_elems];
CUDA_CHECK(cudaMemcpy(host, recv_data, num_elems * sizeof(float), cudaMemcpyDefault));
int ref = (mype - 1 + npes) % npes;
bool success = true;
for (int i = 0; i < num_elems; ++i) {
if (host[i] != ref) {
printf("Error at %d of rank %d: %f\n", i, mype, host[i]);
success = false;
break;
}
}
if (success) {
printf("[%d of %d] run complete \n", mype, npes);
} else {
printf("[%d of %d] run failure \n", mype, npes);
}
nvshmem_free(send_data);
nvshmem_free(recv_data);
nvshmem_finalize();
return 0;
}
As you can see, this code only uses the nvshmemx_float_put_block
API call, but I get the same error message:
src/comm/transports/ucx/ucx.cpp:nvshmemt_ucx_remote_amo:853: AMO 8 not supported with the current configuration (GDRCopy Disabled)
src/comm/proxy/proxy.cu:process_channel_amo:471: aborting due to error in process_channel_dma
Does NVSHMEM convert that RMA operation into an atomic operation? And if so, how can I disable it? Or is it simply not possible to use NVSHMEM without GDRCopy on non-NVLink connections?
I’d appreciate any help!
Thanks,
Pascal
[0] Installation Guide :: NVSHMEM Documentation
[1] NVIDIA OpenSHMEM Library (NVSHMEM) Documentation — NVSHMEM 2.6.0 documentation