GDS (nvidia-fs) inconsistent performance (b/w drops significantly)

While running the (gdsio) benchmark from storage layer (NFSoRDMA) VAST. The read/writes start out good ~18G/s but in the middle of benchmark the speed drops significantly as low as 3G/s. Any insights into this?

Here are the test params I am using:

Write Test: gdsio -T 60 -D /gpu-direct-storage/ -w 192 -d 1 -I 1 -x 0 -s 1G -I 1M

IoType: WRITE XferType: GPUD Threads: 192 DataSetSize: 194006016/201326592(KiB) IOSize: 1024(KiB) Throughput: 2.308903 GiB/sec, Avg_Latency: 81215.056568 usecs ops: 189459 total_time 80.132658 secs

Read Test: gdsio -T 60 -D /gpu-direct-storage/ -w 192 -d 1 -I 0 -x 0 -s 1G -I 1M

IoType: READ XferType: GPUD Threads: 192 DataSetSize: 570586112/201326592(KiB) IOSize: 1024(KiB) Throughput: 6.885191 GiB/sec, Avg_Latency: 27231.054742 usecs ops: 557213 total_time 79.032427 secs

GDS Disabled Read Test: switching -X <xfer_type> 2 (CPU_GPU) the throughput is far better:

IoType: READ XferType: CPU_GPU Threads: 192 DataSetSize: 1378359296/201326592(KiB) IOSize: 1024(KiB) Throughput: 20.600775 GiB/sec, Avg_Latency: 9095.061970 usecs ops: 1346054 total_time 63.808562 secs

We have verified following:

  1. IOMMU is disabled
  2. nvidia-peermem.ko and nvidia-fs.ko are installed correctly and -p utility reports NFS is supported
  3. nvidia-smi topo -m reports following:
 nvidia-smi topo -m
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity
GPU0     X      NV4     SYS     SYS     0-31    0
GPU1    NV4      X      SYS     SYS     0-31    0
NIC0    SYS     SYS      X      PIX
NIC1    SYS     SYS     PIX      X

Any insights what I should try next to debug this issue. Thank you for your time.