NVIDIA GDS output exceeds NVMe device throughput

Hi Team

Below we enabled Nvidia A100s GPUs on Dell R760 server with U.2 NVMe Device .

[root@A100 ~]# /usr/local/cuda-12.3/gds/tools/gdscheck.py -p
 GDS release version: 1.8.1.2
 nvidia_fs version:  2.18 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : false
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : true
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-PCIE-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-PCIE-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12030
 Platform: PowerEdge R760, Arch: x86_64(Linux 5.14.0-362.13.1.el9_3.x86_64)
 Platform verification succeeded

Below is the output from GDSIO tool where throughput is reaching around 175 GiB/s which is more than the NVMe device pcie Gen5 speed , is GDSIO tool reporting HBM memory throughput or where its coming from ?

Running ['sudo', '/usr/local/cuda-12.3/gds/tools/gdsio', '-D', '/mnt/gds', '-d', '1', '-n', '1', '-T', '30', '-s', '30G', '-i', '1M', '-w', '1', '-x', '0', '-I', '2']

['IoType:', 'RANDREAD', 'XferType:', 'GPUD', 'Threads:', '1', 'DataSetSize:', '615096320/31457280(KiB)', 'IOSize:', '1024(KiB)', 'Throughput:', '19.676493', 'GiB/sec,', 'Avg_Latency:', '49.629803', 'usecs', 'ops:', '600680', 'total_time', '29.812302', 'secs\n']

latency 49.629803 throughput 19.676493

Running ['sudo', '/usr/local/cuda-12.3/gds/tools/gdsio', '-D', '/mnt/gds', '-d', '1', '-n', '1', '-T', '30', '-s', '30G', '-i', '1M', '-w', '1', '-x', '2', '-I', '2']

['IoType:', 'RANDREAD', 'XferType:', 'CPU_GPU', 'Threads:', '1', 'DataSetSize:', '322764800/31457280(KiB)', 'IOSize:', '1024(KiB)', 'Throughput:', '10.600037', 'GiB/sec,', 'Avg_Latency:', '92.127097', 'usecs', 'ops:', '315200', 'total_time', '29.038815', 'secs\n']

latency 92.127097 throughput 10.600037

Running ['sudo', '/usr/local/cuda-12.3/gds/tools/gdsio', '-D', '/mnt/gds', '-d', '1', '-n', '1', '-T', '30', '-s', '30G', '-i', '1M', '-w', '1', '-x', '4', '-I', '2']

['IoType:', 'RANDREAD', 'XferType:', 'CPU_CACHED_GPU', 'Threads:', '1', 'DataSetSize:', '129925120/31457280(KiB)', 'IOSize:', '1024(KiB)', 'Throughput:', '4.236128', 'GiB/sec,', 'Avg_Latency:', '230.524078', 'usecs', 'ops:', '126880', 'total_time', '29.249885', 'secs\n']

latency 230.524078 throughput 4.236128

Running ['sudo', '/usr/local/cuda-12.3/gds/tools/gdsio', '-D', '/mnt/gds', '-d', '1', '-n', '1', '-T', '30', '-s', '30G', '-i', '1M', '-w', '32', '-x', '0', '-I', '2']

['IoType:', 'RANDREAD', 'XferType:', 'GPUD', 'Threads:', '32', 'DataSetSize:', '5262499840/1006632960(KiB)', 'IOSize:', '1024(KiB)', 'Throughput:', '172.277523', 'GiB/sec,', 'Avg_Latency:', '181.402613', 'usecs', 'ops:', '5139160', 'total_time', '29.131548', 'secs\n']

latency 181.402613 throughput 172.277523

Running ['sudo', '/usr/local/cuda-12.3/gds/tools/gdsio', '-D', '/mnt/gds', '-d', '1', '-n', '1', '-T', '30', '-s', '30G', '-i', '1M', '-w', '32', '-x', '2', '-I', '2']

['IoType:', 'RANDREAD', 'XferType:', 'CPU_GPU', 'Threads:', '32', 'DataSetSize:', '584962048/1006632960(KiB)', 'IOSize:', '1024(KiB)', 'Throughput:', '18.204442', 'GiB/sec,', 'Avg_Latency:', '1716.491343', 'usecs', 'ops:', '571252', 'total_time', '30.644350', 'secs\n']

latency 1716.491343 throughput 18.204442

@kmodukuri , can you please explain the above behaviour when running GDS IO with 1M sizes using 1 and 32 TCs ?

@kmodukuri , can you provide any insights for GDSIO , i guess the tool is clocking wrong outputs

@Robert_Crovella , there seems to be an issue with gdsio tool clocking IO throughput values , can anybody from NVIDIA have a look and we can provide necessary information regarding this

Sorry, I don’t have any information this topic. I won’t be able to respond to requests here. Hopefully others tagged here may be able to respond at some point.

@karanveersingh5623 ,

yes looks like bug with -x 0 test .
Is this reproducible all the time ?

@karanveersingh5623 please file an NVBug for the above issue.

@kmodukuri , every time its reproducible , below script you can try

import subprocess
import argparse
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')

load_type = {'SEQ_READ': 0, 'SEQ_WRITE':1, 'RAND_READ': 2, 'RAND_WRITE': 3}
# transfer_type = {'Storage->GPU (GDS)': 0, 'Storage->CPU': 1, 'Storage->CPU->GPU': 2, 'Storage->CPU-GPU_ASYNC', 3, 'Storage->PAGE_CACHE->CPU->GPU': 4, 'Storage->GPU_ASYNC': 5, 'STORAGE->GPU_BATCH': 6}
transfer_type = {'Storage->GPU (GDS)': 0, 'Storage->CPU->GPU': 2, 'Storage->PAGE_CACHE->CPU->GPU': 4}


def init_gds_files(gdsio_path, output_dir, file_size, device, workers):
    ''' To do read tests, write test must be done first with correct number of workers and file size '''

    # Just do a random write with the correct number of workers, will generate gdsio.[0 - <workers - 1>]
    cmd = ['sudo', gdsio_path, '-D', output_dir, '-d', device, '-T', '1', '-s', file_size, '-w', workers, '-I', 3]
    cmd = [str(x) for x in cmd]
    subprocess.run(cmd)

def main(gdsio_path, output_dir, device, numa_node, load):
    file_size = '30G'
    io_sizes = ['128K', '256K', '512K', '1M', '4M', '16M', '64M', '128M']
    threads = [1, 4, 16, 32]
    time = '30'

    # See if benchmark files need to be generated
    if not os.path.isfile(os.path.join(output_dir, f'gdsio.{max(threads) - 1}')):
        init_gds_files(gdsio_path, output_dir, file_size, device, max(threads))


    res_dict = {'Transfer Type': [], 'Threads': [], 'Throughput (GiB/s)': [], 'Latency (usec)': [], 'IO Size': []}

    base_cmd = ['sudo', gdsio_path, '-D', output_dir, '-d', device, '-n', numa_node, '-T', time, '-s', file_size]
    for io_size in io_sizes:
        for thread in threads:
            for transfer_name, x in transfer_type.items():
                new_cmd = base_cmd + ['-i', io_size] + ['-w', thread] + ['-x', x] + ['-I', load_type[load]]
                new_cmd = [str(x) for x in new_cmd]
                print('Running', new_cmd)
                res = subprocess.run(new_cmd, capture_output=True).stdout
                res = str(res).split(' ')
                latency = float(res[res.index('Avg_Latency:') + 1])
                throughput = float(res[res.index('Throughput:') + 1])
                print('latency', latency, 'throughput', throughput)

                res_dict['Transfer Type'].append(transfer_name)
                res_dict['Threads'].append(thread)
                res_dict['IO Size'].append(io_size)
                res_dict['Latency (usec)'].append(latency)
                res_dict['Throughput (GiB/s)'].append(throughput)


    df = pd.DataFrame.from_dict(res_dict)
    df.to_csv(f'gds_bench_save_device_{device}_numa_{numa_node}_{load}.csv')

def plot_results(device, numa_node, load):
    df = pd.read_csv(f'gds_bench_save_device_{device}_numa_{numa_node}_{load}.csv')
    
    g = sns.catplot(df, kind='bar', x='Threads', y='Latency (usec)', col='IO Size', hue='Transfer Type', sharey=False)
    g.figure.savefig('gds_plot_latency.png')
    g = sns.catplot(df, kind='bar', x='Threads', y='Throughput (GiB/s)', col='IO Size', hue='Transfer Type', sharey=False)
    g.figure.savefig('gds_plot_throughput.png')

if __name__ == '__main__':
    gdsio_path = '/usr/local/cuda/gds/tools/gdsio'
    gds_dir = 'nvme_mount/gds_benchmarks/'
    device = 1
    numa_node = 1
    load = 'RAND_READ'
    # main(gdsio_path, gds_dir, device, numa_node, load)

    plot_results(device, numa_node, load)

File bug in this repo ?

@kmodukuri
Also there is another issue for GDS underperformance on single node ( 1 Nvme device / MD0 (2 devices))
GDS is performing bad than GPU -CPU ( -x 2) and PageCache (-x 4)

Below results are MD0 using 2 X NVMe PCIe Gen5

Transfer Type Threads Throughput (GiB/s) Latency (usec) IO Size
0 Storage->GPU (GDS) 1 1.913241 510.290625 1M
1 Storage->CPU->GPU 1 5.749915 169.789062 1M
2 Storage->PAGE_CACHE->CPU->GPU 1 2.472049 394.949512 1M
3 Storage->GPU (GDS) 32 8.889768 3515.070397 1M
4 Storage->CPU->GPU 32 6.064824 5152.629936 1M
5 Storage->PAGE_CACHE->CPU->GPU 32 9.295686 3362.841263 1M

@kmodukuri , also attaching cufile.log from GDS (-X 0) and output from CPU-GPU (-X 2)

[root@A100 ~]# /usr/local/cuda-12.3/gds/tools/gdsio -D /mnt/gds -d 0 -s 10G -i 1M -w 32 -x 0 -I 2 -k 1234 -o 1M -V
IoType: RANDREAD XferType: GPUD Threads: 32 DataSetSize: 334745600/335544320(KiB) IOSize: 1024(KiB) Throughput: 12.381408 GiB/sec, Avg_Latency: 2523.604404 usecs ops: 326900 total_time 25.783681 secs
[root@A100 ~]#
[root@A100 ~]#
[root@A100 ~]# /usr/local/cuda-12.3/gds/tools/gdsio -D /mnt/gds -d 0 -s 10G -i 1M -w 32 -x 2 -I 2 -k 1234 -o 1M -V
IoType: RANDREAD XferType: CPU_GPU Threads: 32 DataSetSize: 335350784/335544320(KiB) IOSize: 1024(KiB) Throughput: 13.172497 GiB/sec, Avg_Latency: 2371.958553 usecs ops: 327491 total_time 24.279029 secs

cufile-X0.log (27.3 MB)