Why is the CPU usage higher when transferring data via GDS compared to transferring data through the CPU?

I want to test the transfer speed and CPU usage of GDS using gdsio.While transferring data, I used top to monitor CPU usage. However, I noticed that the CPU usage was higher when using -x 0 (with GDS enabled) compared to using -x 2:

sudo /usr/local/cuda-12.2/gds/tools/gdsio -f /mnt/dd.txt -d 0 -w 4 -s 100G -i 1M -I 0 -x 0
IoType: READ XferType: GPUD Threads: 4 DataSetSize: 104837120/104857600(KiB) IOSize: 1024(KiB) Throughput: 6.622534 GiB/sec, Avg_Latency: 589.436984 usecs ops: 102380 total_time 15.097011 secs

top
PID USER      PR  NI    VIRT    RES    SHR    %CPU  %MEM     TIME+ COMMAND  
   2685 root      20   0 5572460 174724  91880 S  82.1   1.1   0:08.72 gdsio 
sudo /usr/local/cuda-12.2/gds/tools/gdsio -f /mnt/dd.txt -d 0 -w 4 -s 100G -i 1M -I 0 -x 2
IoType: READ XferType: CPU_GPU Threads: 4 DataSetSize: 104739840/104857600(KiB) IOSize: 1024(KiB) Throughput: 6.573733 GiB/sec, Avg_Latency: 593.864340 usecs ops: 102285 total_time 15.194972 secs


top
PID USER      PR  NI    VIRT    RES    SHR    %CPU  %MEM     TIME+ COMMAND  
   2758 root      20   0 5199792 108296  93488 S  72.8   0.7   0:04.88 gdsio 
sudo /usr/local/cuda-12.2/gds/tools/gdsio -f /mnt/dd.txt -d 0 -w 4 -s 100G -i 1M -I 0 -x 1
IoType: READ XferType: CPUONLY Threads: 4 DataSetSize: 104857600/104857600(KiB) IOSize: 1024(KiB) Throughput: 6.478798 GiB/sec, Avg_Latency: 597.596406 usecs ops: 102400 total_time 15.434962 secs


top
PID USER      PR  NI    VIRT    RES    SHR    %CPU  %MEM     TIME+ COMMAND  
   2783 root      20   0 4835100  13736   8408 S  43.9   0.1   0:03.79 gdsio  

It seems that when using CPU for data transfer, the CPU usage is actually the lowest.

Additionally, although the transfer rate using GDS has slightly improved, the gap is very small. Does this align with expectations?

Besides, why are samples no longer provided in versions after GDS 12.2? I couldn’t find the /gds/samples folder in either CUDA 12.4 or CUDA 12.9.

I noticed that I was using compatible mode, which seems to explain the high CPU usage. Then I set allow_compat_mode to false in /etc/cufile.json, and an error occurred when I tried to perform write operations again:

sudo /usr/local/cuda-12.2/gds/tools/gdsio -f /mnt/dd.txt -d 0 -w 4 -s 100G -i 1M -I 1 -x 0
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :80530636800, block size  :1048576
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :26843545600, block size  :1048576
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size  :1048576
write io failed of type 1 size: 1048576 , ret: 0 
failed to submit io of type 1 ret: -5 
Error: IO failed stopping traffic, fd :51 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :53687091200, block size  :1048576

My nvidia-fs is:

dpkg -l | grep nvidia-fs
ii  nvidia-fs                                          2.25.7-1                                amd64        NVIDIA filesystem for GPUDirect Storage
ii  nvidia-fs-dkms                                     2.25.7-1                                amd64        NVIDIA filesystem DKMS package