GDS seems to consume more CPU and memory resources than expected

Hi there,

I need some help in understanding the performance of GDS.

I am using the gdsio utility to explore the performance of GDS. My experiments show that Storage -> GPU consumes more CPU resources than Storage -> CPU -> GPU. This outcome is contrary to my expectations, and I am not sure about the reasons. Below are the details of my setup and results.

Hardware Setup

  • CPU: INTEL(R) XEON(R) GOLD 6526Y, 64 cores
  • GPU: NVIDIA A100-SXM4-40GB
  • SSD: SAMSUNG MZQL21T9HCJR-00A07, local

Software Setup

  • Ubuntu 22.04, Linux kernel 5.15.0
  • MLNX_OFED: MLNX_OFED_LINUX-24.10-1.1.4.0-ubuntu22.04-x86_64
  • cuda 12.6
  • GDS release version: 1.11.1.6
  • nvidia_fs version: 2.22
  • libcufile version: 2.12
  • Platform: x86_64
  • Nvidia: 560.35.05

iommu is disabled.

Experiments

I manually created a file at /mnt/nvme2n1/20GFile and used the gdsio utility to measure performance. I monitored GPU and CPU usage using nvtop.

The command was like sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "4K" -I 2 -T 20. I tuned the -x parameter to control the data transfer mode.

  • -x 0: Storage → GPU
  • -x 2: Storage → CPU → GPU
  • -x 1: Storage → CPU (used to check if the SSD is the bottleneck)

I use nvtop to monitor the GPU and CPU usage.

The following tables summarize the GPU and CPU behavior under different scenarios.

  • 16 Threads (-w 16)
Storage-> GPU Storage → CPU → GPU Storage → CPU
GPU Usage 40 % 39 %
GPU Memory 424 MiB 422 MiB
CPU Usage 299 % 258 %
Host Memory 115 MiB 115 MiB
Latency 86.53 usec 85.86 usec 72.62 usec
Throughput 0.70 GiB/s 0.71 GiB/s 0.84 GiB/s
  • Increased Threads (-w 32)
Storage-> GPU Storage → CPU → GPU Storage → CPU
GPU Usage 58 % 57 %
GPU Memory 430 MiB 424 MiB
CPU Usage 654 % 550 %
Host Memory 114 MiB 112 MiB
Latency 102.76 usec 98.92 usec 80.77 usec
Throughput 1.18 GiB/s 1.24 GiB/s 1.51 GiB/s

As you can see, the Storage -> GPU operation consumes more CPU resources, while the memory usage remains similar. If the I/O bypasses the data copy process, we would expect the resource usage to decrease, right? Maybe there’s an issue with my configuration?
I haven’t modified any configuration files related to GDS or cuFile; I’m using the default settings.
How can I verify that GDS is functioning correctly and bypassing the data copying process?

Can you please share the output of

/usr/local/cuda/gds/tools/gdscheck -p to make sure the drivers and p2p is setup correctly.

Also please share the relevant PCIe topology between the NVMe and the GPU in use.
nvidia-smi topo -m ,
nvidia-smi topo -name

Also can you try the test at 16k, 64k to see if the pattern is similar ?

Thank you for the suggestions!

Below are the results of the gdscheck -p and nvidia-smi topo -m commands. However, the nvidia-smi topo -name option does not work (It reports: Option “-name” is not recognized).

I observed similar results for both 16K and 64K random read tests.

Results of 16K Random Reads

sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "16K" -I 2 -T 20
Storage-> GPU Storage → CPU → GPU Storage → CPU
GPU Usage 35 % 35 %
GPU Memory 424 MiB 422 MiB
CPU Usage 240 % 208 %
Host Memory 115 MiB 111 MiB
Latency 142.62 usec 142.11 usec 133.85 usec
Throughput 1.71 GiB/s 1.71 GiB/s 1.82 GiB/s

Results of 64K Random Reads

sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "64K" -I 2 -T 20
Storage-> GPU Storage → CPU → GPU Storage → CPU
GPU Usage 39 % 22 %
GPU Memory 424 MiB 422 MiB
CPU Usage 202 % 148 %
Host Memory 116 MiB 111 MiB
Latency 385.85 usec 381.58 usec 37.23 usec
Throughput 2.53 GiB/s 2.56 GiB/s 2.59 GiB/s

sorry for the typo. I am looking for topology information for

“nvidia-smi topo -nvme”

Looks like the NVME p2p path is seeing higher latency.

Also if you can share the PCIe topology running the commands

“sudo lspci -tvvvv”
“sudo lspci -nn”

Also can you please check if for some reason, the IO is going in compatible mode. check for log entries in cufile.log.

Thank you for clarifying the command. Here are the results.

There are some error messages in cufile.log:

 10-01-2025 08:58:47:618 [pid=10987 tid=10987] ERROR  cufio-fs:79 mount option not found in mount table data device: /dev/nvme2n1
 10-01-2025 08:58:47:618 [pid=10987 tid=10987] ERROR  cufio-fs:152 EXT4 journal options not found in mount table for device,can't verify data=ordered mode journalling
 10-01-2025 08:58:47:618 [pid=10987 tid=10987] NOTICE  cufio:293 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled

You are running in compatible mode.

please unmount and mount the drive with data=ordered option and share the result.

After mounting the driver with data=ordered option, the CPU utilization is dropped from 246% to 115%!

I really appreciate your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.