I’m ran NVSHMEM perftest/host/coll/alltoall_on_stream on 2 H100 node. I found that when I enable gda and gpucopy, the performance was really bad.
NVSHMEM build script:
CUDA_HOME=/usr/local/cuda-12.2 \
MPI_HOME=/opt/openmpi/ \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_USE_GDRCOPY=1 \
my execution command
nvshmrun -np 16 -host node0:8,node1:8 -env NVSHMEM_DEBUG INFO -env NVSHMEM_IB_ENABLE_IBGDA true -env NVSHMEM_DISABLE_GDRCOPY false -env NVSHMEM_DISABLE_NCCL=1 -env NVSHMEM_IB_GID_INDEX 3 /nvshmem_src/bin/perftest/host/coll/alltoall_on_stream -w 20 -n 1000
My result :
#alltoall_on_stream
size(B) type latency(us) min_lat(us) max_lat(us) algbw(GB/s) busbw(GB/s)
64 int 142.365632 134.016 191.840 0.000 0.000
128 int 99.843488 88.352 1384.896 0.001 0.001
256 int 65.940000 58.496 75.328 0.004 0.004
512 int 53.502976 50.080 60.224 0.010 0.009
1024 int 51.095424 47.712 223.968 0.020 0.019
2048 int 40.465600 37.248 1191.296 0.051 0.047
4096 int 41.163072 38.720 47.520 0.100 0.093
8192 int 41.944320 39.712 529.696 0.195 0.183
16384 int 41.164768 39.072 46.912 0.398 0.373
32768 int 41.365984 39.424 45.440 0.792 0.743
65536 int 41.904736 40.384 71.232 1.564 1.466
131072 int 43.321088 39.648 869.376 3.026 2.836
262144 int 43.899552 41.056 609.440 5.971 5.598
524288 int 53.964544 52.288 59.232 9.715 9.108
1048576 int 60.450848 58.112 70.176 17.346 16.262
2097152 int 77.826560 75.488 361.440 26.946 25.262
4194304 int 121.894048 118.112 892.320 34.409 32.259
I think size 64B - 1024B is abnoalmal.
Could you please tell me how to fix this problem?