Dear GDS Team,
I’ve been exploring the Async GDS API and encountered an issue in the number of read and write IO requests reported in the cufile.log
file.
Specifically, after running the first Async API tutorial code (MagnumIO/gds/samples/cufile_sample_031.cc at main · NVIDIA/MagnumIO · GitHub), I observed the following statistics:
GLOBAL STATS:
Read: ok = 2 err = 0
Write: ok = 2 err = 0
HandleRegister: ok = 2 err = 0
HandleDeregister: ok = 2 err = 0
BufRegister: ok = 1 err = 0
BufDeregister: ok = 1 err = 0
BatchSubmit: ok = 0 err = 0
BatchComplete: ok = 0 err = 0
BatchSetup: ok = 0 err = 0
BatchCancel: ok = 0 err = 0
BatchDestroy: ok = 0 err = 0
BatchEnqueued: ok = 0 err = 0
PosixBatchEnqueued: ok = 0 err = 0
BatchProcessed: ok = 0 err = 0
PosixBatchProcessed: ok = 0 err = 0
I’m puzzled by the very first two lines reported: 2 reads and 2 writes, as the samaple only submitted one cuFileReadAsync
and one cuFileWriteAsync
operation. Could you please clarify why GDS appears to double the read and write counts in this scenario?
I hope this doubled report in cufile.log
is reproducable at your side. I was only run the code with one GPU and one SSD. Otherwise, I will post more configuration on my side.
Thank you for your assistance.
Versions:
$ gdscheck -v
GDS release version: 1.7.2.10
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
NVIDIA-SMI 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2
The whole log:
27-02-2024 23:14:47:639 [pid=1516982 tid=1516982] INFO 0:324 Lib being used for urcup concurrency : libcufile_ck
27-02-2024 23:14:47:639 [pid=1516982 tid=1516982] INFO cufio_core:556 Loaded successfully libcufile_ck.so
27-02-2024 23:14:47:640 [pid=1516982 tid=1516982] INFO cufio_core:556 Loaded successfully libmount.so
27-02-2024 23:14:47:640 [pid=1516982 tid=1516982] INFO cufio_core:556 Loaded successfully libudev.so
27-02-2024 23:14:47:640 [pid=1516982 tid=1516982] INFO cufio_core:560 Using CKIT static library
27-02-2024 23:14:47:640 [pid=1516982 tid=1516982] INFO 0:163 nvidia_fs driver open invoked
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:401 GDS release version: 1.7.2.10
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:404 nvidia_fs version: 2.17 libcufile version: 2.12
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:408 Platform: x86_64
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:290 NVMe: driver support OK
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:329 WekaFS: driver support OK
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:528 nvidia_fs driver version check ok
27-02-2024 23:14:47:642 [pid=1516982 tid=1516982] INFO cufio-drv:290 NVMe: driver support OK
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:329 WekaFS: driver support OK
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:189 ============
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:190 ENVIRONMENT:
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:191 ============
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:204 =====================
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:205 DRIVER CONFIGURATION:
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:206 =====================
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:208 NVMe : Supported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:209 NVMeOF : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:210 SCSI : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:211 ScaleFlux CSD : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:212 NVMesh : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:215 DDN EXAScaler : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:219 IBM Spectrum Scale : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:223 NFS : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-drv:226 BeeGFS : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1126 WekaFS : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1128 Userspace RDMA : Unsupported
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1136 --Mellanox PeerDirect : Enabled
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1144 --rdma library : Not Loaded (libcufile_rdma.so)
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1147 --rdma devices : Not configured
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-rdma:1150 --rdma_device_status : Up: 0 Down: 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio_core:938 =====================
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio_core:939 CUFILE CONFIGURATION:
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio_core:940 =====================
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1321 properties.use_compat_mode : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1323 properties.force_compat_mode : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1325 properties.gds_rdma_write_support : true
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1327 properties.use_poll_mode : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1329 properties.poll_mode_max_size_kb : 4
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1331 properties.max_batch_io_size : 128
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1333 properties.max_batch_io_timeout_msecs : 5
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1335 properties.max_direct_io_size_kb : 16384
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1337 properties.max_device_cache_size_kb : 1048576
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1339 properties.max_device_pinned_mem_size_kb : 33554432
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1341 properties.posix_pool_slab_size_kb : 4 1024 16384
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1343 properties.posix_pool_slab_count : 128 64 32
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1345 properties.rdma_peer_affinity_policy : RoundRobin
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1347 properties.rdma_dynamic_routing : 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1354 fs.generic.posix_unaligned_writes : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1357 fs.lustre.posix_gds_min_kb: 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1371 fs.beegfs.posix_gds_min_kb: 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1386 fs.weka.rdma_write_support: false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1412 fs.gpfs.gds_write_support: false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1425 profile.nvtx : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1427 profile.cufile_stats : 3
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1429 miscellaneous.api_check_aggressive : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1439 execution.max_io_threads : 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1440 execution.max_io_queue_depth : 128
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1441 execution.parallel_io : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1442 execution.min_io_threshold_size_kb : 1024
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1443 execution.max_request_parallelism : 0
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1444 properties.force_odirect_mode : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO 0:1446 properties.prefer_iouring : false
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:801 =========
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:802 GPU INFO:
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:803 =========
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:436 GPU index 0 Tesla V100-SXM2-16GB bar:1 bar size (MiB):16384 supports GDS, IOMMU State: Pass-through or Enabled
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:436 GPU index 1 Tesla V100-SXM2-16GB bar:1 bar size (MiB):16384 supports GDS, IOMMU State: Pass-through or Enabled
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:450 Total GPUS supported on this platform 2
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:814 ==============
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:815 PLATFORM INFO:
27-02-2024 23:14:47:643 [pid=1516982 tid=1516982] INFO cufio-plat:816 ==============
27-02-2024 23:14:47:644 [pid=1516982 tid=1516982] WARN cufio-plat:564 Found ACS enabled for switch 0000:5d:00.0
27-02-2024 23:14:47:644 [pid=1516982 tid=1516982] WARN cufio-plat:564 Found ACS enabled for switch 0000:85:00.0
27-02-2024 23:14:47:644 [pid=1516982 tid=1516982] INFO cufio-plat:734 cannot open scsi_mod path, skip scsi check
27-02-2024 23:14:47:644 [pid=1516982 tid=1516982] INFO cufio-plat:821 use_mq not detected in scsi configuration.cannot support SCSI disks!
27-02-2024 23:14:47:644 [pid=1516982 tid=1516982] INFO cufio-plat:705 IOMMU: Pass-through or enabled
27-02-2024 23:14:47:649 [pid=1516982 tid=1516982] INFO cufio-plat:723 WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt
27-02-2024 23:14:47:649 [pid=1516982 tid=1516982] INFO cufio-plat:857 Platform verification succeeded
27-02-2024 23:14:47:657 [pid=1516982 tid=1516982] INFO cufio-px-pool:453 POSIX pool buffer initialization complete
27-02-2024 23:14:47:657 [pid=1516982 tid=1516982] INFO curdma-ldbal:510 No RDMA devices configured,skipping RDMA load balancer initialization
27-02-2024 23:14:47:659 [pid=1516982 tid=1516982] INFO cufio_core:1004 CUFile initialization complete
27-02-2024 23:14:47:670 [pid=1516982 tid=1516982] INFO cufio-fs:357 Block dev: /dev/nvme1n1 numa node: 0 pci bridge: 0000:3a:02.0
27-02-2024 23:14:47:671 [pid=1516982 tid=1516982] INFO cufio-fs:357 Block dev: /dev/nvme2n1 numa node: 0 pci bridge: 0000:3a:00.0
27-02-2024 23:14:47:707 [pid=1516982 tid=1516982] INFO cufio_core:118 cuFile STATS VERSION : 8
GLOBAL STATS:
Read: ok = 2 err = 0
Write: ok = 2 err = 0
HandleRegister: ok = 2 err = 0
HandleDeregister: ok = 2 err = 0
BufRegister: ok = 1 err = 0
BufDeregister: ok = 1 err = 0
BatchSubmit: ok = 0 err = 0
BatchComplete: ok = 0 err = 0
BatchSetup: ok = 0 err = 0
BatchCancel: ok = 0 err = 0
BatchDestroy: ok = 0 err = 0
BatchEnqueued: ok = 0 err = 0
PosixBatchEnqueued: ok = 0 err = 0
BatchProcessed: ok = 0 err = 0
PosixBatchProcessed: ok = 0 err = 0
Total Read Size (MiB): 2
Read BandWidth (GiB/s): 0
Avg Read Latency (us): 0
Total Write Size (MiB): 2
Write BandWidth (GiB/s): 0
Avg Write Latency (us): 0
Total Batch Read Size (MiB): 0
Total Batch Write Size (MiB): 0
Batch Read BandWidth (GiB/s): 0
Batch Write BandWidth (GiB/s): 0
Avg Batch Submit Latency (us): 0
Avg Batch Completion Latency (us): 0
READ-WRITE SIZE HISTOGRAM :
0-4(KiB): 0 0
4-8(KiB): 0 0
8-16(KiB): 0 0
16-32(KiB): 0 0
32-64(KiB): 0 0
64-128(KiB): 0 0
128-256(KiB): 0 0
256-512(KiB): 0 0
512-1024(KiB): 0 0
1024-2048(KiB): 2 2
2048-4096(KiB): 0 0
4096-8192(KiB): 0 0
8192-16384(KiB): 0 0
16384-32768(KiB): 0 0
32768-65536(KiB): 0 0
65536-...(KiB): 0 0
PER_GPU STATS:
GPU 0(UUID: fb621244b33a3625ba373712b627b55) Read: bw=0 util(%)=0 n=1 posix=0 unalign=0 dr=0 r_sparse=0 r_inline=0 err=0 MiB=1 Write: bw=0 util(%)=0 n=1 posix=0 unalign=0 dr=0 err=0 MiB=1 BufRegister: n=1 err=0 free=1 MiB=0
GPU 1(UUID: e019366ad84f43dada9287dd2d9f) Read: bw=0 util(%)=0 n=0 posix=0 unalign=0 dr=0 r_sparse=0 r_inline=0 err=0 MiB=0 Write: bw=0 util(%)=0 n=0 posix=0 unalign=0 dr=0 err=0 MiB=0 BufRegister: n=0 err=0 free=0 MiB=0
PER_GPU POOL BUFFER STATS:
GPU : 0 pool_size_MiB : 1 usage : 0/1 used_MiB : 0
PER_GPU POSIX POOL BUFFER STATS:
PER_GPU RDMA STATS:
GPU 0000:62:00.0(UUID: fb621244b33a3625ba373712b627b55) :
GPU 0000:89:00.0(UUID: e019366ad84f43dada9287dd2d9f) :
RDMA MRSTATS:
peer name nr_mrs mr_size(MiB)
PER GPU THREAD POOL STATS:
gpu node: 0 enqueues:0 completes:0 pending suspends:0 pending yields:0 active:0 suspends:0
gpu node: 1 enqueues:0 completes:0 pending suspends:0 pending yields:0 active:0 suspends:0
27-02-2024 23:14:47:707 [pid=1516982 tid=1516982] INFO cufio-px-pool:484 POSIX pool buffer release complete
27-02-2024 23:14:48:720 [pid=1516982 tid=1516982] INFO 0:136 nvidia_fs driver closed