GPUDirect Storage issue on NVIDIA DGX A100 System

Hello,

I am trying to test out GPUDirect storage capabilities on an NVIDIA DGX A100 40GB system. I have compiled the sample CUFile application shown here:

When I run this app, I get the following error:

Opening File /home/sashok6/test.dat
Opening cuFileDriver.
Registering cuFile handle to /home/sashok6/test.dat.
cuFileHandleRegister fd 3 status 5030

This error code corresponds to CU_FILE_INTERNAL_ERROR.

We are using the CUDA 12.0 driver and toolkit.

Please advise on how I can eliminate this error. Thank you!

Can you share the cufile.log file to check the reason for the error? It should be available in the directory where you ran your application.

Below is the output from cufile.log:

 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:318 [pid=2814460 tid=2814460] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] NOTICE  cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  cufio-fs:322 error creating udev_device for block device dev_no: 0:52
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  cufio-obj:177 unable to get volume attributes for fd 3
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  cufio:1556 cuFileHandleRegister error, failed to allocate file object
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  cufio:1584 cuFileHandleRegister error: internal error
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4
 08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR  0:106 cuDeviceGet failed with error 4

From the logs it looks like we are not able to get the udev attributes of the block device. Can you tell me what kind of block device and which file system is being used for the IO operation?

I am not sure how to find the kind of block device and file system. We are trying to write to the NVME SSDs available on the NVIDIA DGX A100.

Do you know any commands that I could run to find the information needed?

Can you provide following?

  1. stat
  2. df -Th

Here is the output of those commands:

sashok6@ae-csml-407019@ ~/olb/apps/hackathon2023/cuFileTest (shreyas-rotor) $ stat /home/sashok6
  File: /home/sashok6
  Size: 21              Blocks: 24         IO Block: 1536   directory
Device: 34h/52d Inode: 34          Links: 10
Access: (0700/drwx------)  Uid: (807325/ sashok6)   Gid: ( 2626/gtperson)
Access: 2023-03-06 13:49:13.652791773 -0800
Modify: 2022-04-20 06:03:50.000000000 -0700
Change: 2023-03-06 13:49:18.340759351 -0800
 Birth: -
sashok6@ae-csml-407019@ ~/olb/apps/hackathon2023/cuFileTest (shreyas-rotor) $ df -Th
Filesystem           Type      Size  Used Avail Use% Mounted on
udev                 devtmpfs  252G     0  252G   0% /dev
tmpfs                tmpfs      51G  4.5M   51G   1% /run
/dev/nvme0n1p4       ext4      1.8T  208G  1.5T  13% /
tmpfs                tmpfs     252G  7.9M  252G   1% /dev/shm
tmpfs                tmpfs     5.0M  4.0K  5.0M   1% /run/lock
tmpfs                tmpfs     252G     0  252G   0% /sys/fs/cgroup
/dev/nvme0n1p2       ext4      2.3G  570M  1.6G  26% /boot
home                 zfs       9.3T  256K  9.3T   1% /home
home/dijuremo_d544ba zfs       9.3T  9.2M  9.3T   1% /home/dijuremo
home/teason30_d544ba zfs       9.3T  5.2M  9.3T   1% /home/teason30
home/sashok6_d544ba  zfs       9.7T  406G  9.3T   5% /home/sashok6
home/psu37_d544ba    zfs       9.5T  178G  9.3T   2% /home/psu37
home/ekurban3_d544ba zfs       9.3T   11G  9.3T   1% /home/ekurban3
/dev/loop2           squashfs   92M   92M     0 100% /snap/lxd/23991
/dev/loop4           squashfs   64M   64M     0 100% /snap/core20/1778
/dev/loop5           squashfs   92M   92M     0 100% /snap/lxd/24061
/dev/loop6           squashfs   50M   50M     0 100% /snap/snapd/17950
tmpfs                tmpfs      51G   20K   51G   1% /run/user/128
/dev/loop8           squashfs   56M   56M     0 100% /snap/core18/2679
tmpfs                tmpfs      51G  4.0K   51G   1% /run/user/807325
/dev/loop0           squashfs   64M   64M     0 100% /snap/core20/1822
/dev/loop1           squashfs   50M   50M     0 100% /snap/snapd/18357
/dev/loop7           squashfs   56M   56M     0 100% /snap/core18/2697

From the df -Th log
/home/sashok6/ has zfs file system which is not supported by GDS. Only ext4 and xfs are the local file systems supported by GDS currently. Please refer to this documentation for the same.

Ahh, that looks like the problem. Thanks for the help! I’ll look into getting an ext4 partition set up for further testing.

Would you have any insight when ZFS could be supported or why it is not - is it an issue on ZFS or GPUDirect or mix of both? We would be very interested in ZFS support.

I encountered the same error code, but the log in cufile.log is different from that of the first questioner. How should I troubleshoot the problem? In addition, I am running in a k8s pod, could that be the reason?

 12-01-2024 19:05:48:25 [pid=55 tid=55] NOTICE  cufio-drv:693 running in compatible mode
 12-01-2024 19:05:52:737 [pid=55 tid=55] ERROR  cufio-udev:68 udev property not found: ID_FS_USAGE dm-0
 12-01-2024 19:05:52:737 [pid=55 tid=55] ERROR  cufio-fs:744 error getting volume attributes error for device: dev_no: 253:0
 12-01-2024 19:05:52:737 [pid=55 tid=55] ERROR  cufio-obj:165 unable to get volume attributes for fd 3
 12-01-2024 19:05:52:737 [pid=55 tid=55] ERROR  cufio:1038 cuFileHandleRegister error, failed to allocate file object
 12-01-2024 19:05:52:737 [pid=55 tid=55] ERROR  cufio:1067 cuFileHandleRegister error: internal error

df -T log as follows

cudatookit 11.6, Driver Version: 535.129.03 CUDA Version: 12.2