GDS seems to consume more CPU and memory resources than expected

yk1234 · January 7, 2025, 9:54am

Hi there,

I need some help in understanding the performance of GDS.

I am using the gdsio utility to explore the performance of GDS. My experiments show that Storage -> GPU consumes more CPU resources than Storage -> CPU -> GPU. This outcome is contrary to my expectations, and I am not sure about the reasons. Below are the details of my setup and results.

Hardware Setup

CPU: INTEL(R) XEON(R) GOLD 6526Y, 64 cores
GPU: NVIDIA A100-SXM4-40GB
SSD: SAMSUNG MZQL21T9HCJR-00A07, local

Software Setup

Ubuntu 22.04, Linux kernel 5.15.0
MLNX_OFED: MLNX_OFED_LINUX-24.10-1.1.4.0-ubuntu22.04-x86_64
cuda 12.6
GDS release version: 1.11.1.6
nvidia_fs version: 2.22
libcufile version: 2.12
Platform: x86_64
Nvidia: 560.35.05

iommu is disabled.

Experiments

I manually created a file at /mnt/nvme2n1/20GFile and used the gdsio utility to measure performance. I monitored GPU and CPU usage using nvtop.

The command was like sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "4K" -I 2 -T 20. I tuned the -x parameter to control the data transfer mode.

-x 0: Storage → GPU
-x 2: Storage → CPU → GPU
-x 1: Storage → CPU (used to check if the SSD is the bottleneck)

I use nvtop to monitor the GPU and CPU usage.

The following tables summarize the GPU and CPU behavior under different scenarios.

16 Threads (-w 16)

	Storage-> GPU	Storage → CPU → GPU	Storage → CPU
GPU Usage	40 %	39 %
GPU Memory	424 MiB	422 MiB
CPU Usage	299 %	258 %
Host Memory	115 MiB	115 MiB
Latency	86.53 usec	85.86 usec	72.62 usec
Throughput	0.70 GiB/s	0.71 GiB/s	0.84 GiB/s

Increased Threads (-w 32)

	Storage-> GPU	Storage → CPU → GPU	Storage → CPU
GPU Usage	58 %	57 %
GPU Memory	430 MiB	424 MiB
CPU Usage	654 %	550 %
Host Memory	114 MiB	112 MiB
Latency	102.76 usec	98.92 usec	80.77 usec
Throughput	1.18 GiB/s	1.24 GiB/s	1.51 GiB/s

As you can see, the Storage -> GPU operation consumes more CPU resources, while the memory usage remains similar. If the I/O bypasses the data copy process, we would expect the resource usage to decrease, right? Maybe there’s an issue with my configuration?
I haven’t modified any configuration files related to GDS or cuFile; I’m using the default settings.
How can I verify that GDS is functioning correctly and bypassing the data copying process?

kmodukuri · January 7, 2025, 5:56pm

Can you please share the output of

/usr/local/cuda/gds/tools/gdscheck -p to make sure the drivers and p2p is setup correctly.

Also please share the relevant PCIe topology between the NVMe and the GPU in use.
nvidia-smi topo -m ,
nvidia-smi topo -name

Also can you try the test at 16k, 64k to see if the pattern is similar ?

yk1234 · January 9, 2025, 2:00am

Thank you for the suggestions!

Below are the results of the gdscheck -p and nvidia-smi topo -m commands. However, the nvidia-smi topo -name option does not work (It reports: Option “-name” is not recognized).

nvidia_smi_topo_m.txt (779 Bytes)
gdscheck.txt (2.3 KB)

I observed similar results for both 16K and 64K random read tests.

Results of 16K Random Reads

sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "16K" -I 2 -T 20

	Storage-> GPU	Storage → CPU → GPU	Storage → CPU
GPU Usage	35 %	35 %
GPU Memory	424 MiB	422 MiB
CPU Usage	240 %	208 %
Host Memory	115 MiB	111 MiB
Latency	142.62 usec	142.11 usec	133.85 usec
Throughput	1.71 GiB/s	1.71 GiB/s	1.82 GiB/s

Results of 64K Random Reads

sudo gdsio -f "/mnt/nvme2n1/20GFile" -d 0 -x 0 -w 16 -s "20G" -i "64K" -I 2 -T 20

	Storage-> GPU	Storage → CPU → GPU	Storage → CPU
GPU Usage	39 %	22 %
GPU Memory	424 MiB	422 MiB
CPU Usage	202 %	148 %
Host Memory	116 MiB	111 MiB
Latency	385.85 usec	381.58 usec	37.23 usec
Throughput	2.53 GiB/s	2.56 GiB/s	2.59 GiB/s

kmodukuri · January 9, 2025, 5:00pm

sorry for the typo. I am looking for topology information for

“nvidia-smi topo -nvme”

Looks like the NVME p2p path is seeing higher latency.

Also if you can share the PCIe topology running the commands

“sudo lspci -tvvvv”
“sudo lspci -nn”

kmodukuri · January 9, 2025, 5:02pm

Also can you please check if for some reason, the IO is going in compatible mode. check for log entries in cufile.log.

yk1234 · January 10, 2025, 8:30am

Thank you for clarifying the command. Here are the results.

nvidia-smi topo -nvme lspci_nn.txt (20.7 KB)
sudo lspci -tvvvv lspci_tvvvvv.txt (13.6 KB)
sudo lspci -nn nvidia_smi_topo_nvme.txt (621 Bytes)

yk1234 · January 10, 2025, 11:09am

There are some error messages in cufile.log:

 10-01-2025 08:58:47:618 [pid=10987 tid=10987] ERROR  cufio-fs:79 mount option not found in mount table data device: /dev/nvme2n1
 10-01-2025 08:58:47:618 [pid=10987 tid=10987] ERROR  cufio-fs:152 EXT4 journal options not found in mount table for device,can't verify data=ordered mode journalling
 10-01-2025 08:58:47:618 [pid=10987 tid=10987] NOTICE  cufio:293 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled

kmodukuri · January 10, 2025, 3:56pm

You are running in compatible mode.

please unmount and mount the drive with data=ordered option and share the result.

yk1234 · January 11, 2025, 1:11pm

After mounting the driver with data=ordered option, the CPU utilization is dropped from 246% to 115%!

I really appreciate your help!

system · January 25, 2025, 1:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GDS performance not as expected GPU-Accelerated Libraries gds	5	1647	July 9, 2023
GDS performance test results are not as expected GPU-Accelerated Libraries gds	2	687	May 24, 2023
Reading speed of GPU Direct Storage (GDS) is far slower than expectations CUDA Programming and Performance gds	5	1898	November 22, 2022
Why is the CPU usage higher when transferring data via GDS compared to transferring data through the CPU? GPU-Accelerated Libraries	1	16	September 26, 2025
NVIDIA GDS output exceeds NVMe device throughput GPU-Accelerated Libraries gds	10	682	January 9, 2024
Gds tools gdsio ,the Throughput is less then 500M CUDA Programming and Performance cuda	1	1040	August 29, 2022
Putting the GPU at work CUDA Programming and Performance	21	20253	July 5, 2007
how can I read in data by GPU CUDA Programming and Performance	2	3173	March 23, 2009
What hardware to get? CUDA Programming and Performance	6	5341	August 10, 2008
In SDK project the GPU function takes more time than CPU function CUDA Programming and Performance	8	2028	August 17, 2009