NVSHMEM on 2 node GPUs, small size msg latency is very high

I’m ran NVSHMEM perftest/host/coll/alltoall_on_stream on 2 H100 node. I found that when I enable gda and gpucopy, the performance was really bad.
NVSHMEM build script:

CUDA_HOME=/usr/local/cuda-12.2 \
MPI_HOME=/opt/openmpi/ \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_USE_GDRCOPY=1 \

my execution command

nvshmrun -np 16 -host node0:8,node1:8 -env NVSHMEM_DEBUG INFO -env NVSHMEM_IB_ENABLE_IBGDA true -env NVSHMEM_DISABLE_GDRCOPY false -env NVSHMEM_DISABLE_NCCL=1  -env NVSHMEM_IB_GID_INDEX 3 /nvshmem_src/bin/perftest/host/coll/alltoall_on_stream -w 20 -n 1000

My result :

#alltoall_on_stream
size(B)     type      latency(us)       min_lat(us)       max_lat(us)       algbw(GB/s)   busbw(GB/s)
64          int       142.365632        134.016           191.840           0.000         0.000
128         int       99.843488         88.352            1384.896          0.001         0.001
256         int       65.940000         58.496            75.328            0.004         0.004
512         int       53.502976         50.080            60.224            0.010         0.009
1024        int       51.095424         47.712            223.968           0.020         0.019
2048        int       40.465600         37.248            1191.296          0.051         0.047
4096        int       41.163072         38.720            47.520            0.100         0.093
8192        int       41.944320         39.712            529.696           0.195         0.183
16384       int       41.164768         39.072            46.912            0.398         0.373
32768       int       41.365984         39.424            45.440            0.792         0.743
65536       int       41.904736         40.384            71.232            1.564         1.466
131072      int       43.321088         39.648            869.376           3.026         2.836
262144      int       43.899552         41.056            609.440           5.971         5.598
524288      int       53.964544         52.288            59.232            9.715         9.108
1048576     int       60.450848         58.112            70.176            17.346        16.262
2097152     int       77.826560         75.488            361.440           26.946        25.262
4194304     int       121.894048        118.112           892.320           34.409        32.259

I think size 64B - 1024B is abnoalmal.
Could you please tell me how to fix this problem?