sashok
February 8, 2023, 5:24pm
1
Hello,
I am trying to test out GPUDirect storage capabilities on an NVIDIA DGX A100 40GB system. I have compiled the sample CUFile application shown here:
The NVIDIA® GPUDirect® Storage cuFile API Reference Guide provides information about the preliminary version of the cuFile API reference guide that is used in applications and frameworks to leverage GDS technology and describes the intent, context,...
When I run this app, I get the following error:
Opening File /home/sashok6/test.dat
Opening cuFileDriver.
Registering cuFile handle to /home/sashok6/test.dat.
cuFileHandleRegister fd 3 status 5030
This error code corresponds to CU_FILE_INTERNAL_ERROR.
We are using the CUDA 12.0 driver and toolkit.
Please advise on how I can eliminate this error. Thank you!
Can you share the cufile.log file to check the reason for the error? It should be available in the directory where you ran your application.
Below is the output from cufile.log:
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:15:48:44 [pid=2809638 tid=2809638] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:17:00:967 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:17:00:968 [pid=2811020 tid=2811020] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:24:790 [pid=2813646 tid=2813646] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:318 [pid=2814460 tid=2814460] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:19:55:319 [pid=2814460 tid=2814460] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:20:18:902 [pid=2815364 tid=2815364] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:23:35:894 [pid=2818803 tid=2818803] NOTICE cufio:1538 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:52
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:52
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR cufio-obj:177 unable to get volume attributes for fd 3
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR cufio:1556 cuFileHandleRegister error, failed to allocate file object
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR cufio:1584 cuFileHandleRegister error: internal error
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
08-02-2023 09:23:35:895 [pid=2818803 tid=2818803] ERROR 0:106 cuDeviceGet failed with error 4
From the logs it looks like we are not able to get the udev attributes of the block device. Can you tell me what kind of block device and which file system is being used for the IO operation?
I am not sure how to find the kind of block device and file system. We are trying to write to the NVME SSDs available on the NVIDIA DGX A100.
Do you know any commands that I could run to find the information needed?
Can you provide following?
stat
df -Th
Here is the output of those commands:
sashok6@ae-csml-407019@ ~/olb/apps/hackathon2023/cuFileTest (shreyas-rotor) $ stat /home/sashok6
File: /home/sashok6
Size: 21 Blocks: 24 IO Block: 1536 directory
Device: 34h/52d Inode: 34 Links: 10
Access: (0700/drwx------) Uid: (807325/ sashok6) Gid: ( 2626/gtperson)
Access: 2023-03-06 13:49:13.652791773 -0800
Modify: 2022-04-20 06:03:50.000000000 -0700
Change: 2023-03-06 13:49:18.340759351 -0800
Birth: -
sashok6@ae-csml-407019@ ~/olb/apps/hackathon2023/cuFileTest (shreyas-rotor) $ df -Th
Filesystem Type Size Used Avail Use% Mounted on
udev devtmpfs 252G 0 252G 0% /dev
tmpfs tmpfs 51G 4.5M 51G 1% /run
/dev/nvme0n1p4 ext4 1.8T 208G 1.5T 13% /
tmpfs tmpfs 252G 7.9M 252G 1% /dev/shm
tmpfs tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs tmpfs 252G 0 252G 0% /sys/fs/cgroup
/dev/nvme0n1p2 ext4 2.3G 570M 1.6G 26% /boot
home zfs 9.3T 256K 9.3T 1% /home
home/dijuremo_d544ba zfs 9.3T 9.2M 9.3T 1% /home/dijuremo
home/teason30_d544ba zfs 9.3T 5.2M 9.3T 1% /home/teason30
home/sashok6_d544ba zfs 9.7T 406G 9.3T 5% /home/sashok6
home/psu37_d544ba zfs 9.5T 178G 9.3T 2% /home/psu37
home/ekurban3_d544ba zfs 9.3T 11G 9.3T 1% /home/ekurban3
/dev/loop2 squashfs 92M 92M 0 100% /snap/lxd/23991
/dev/loop4 squashfs 64M 64M 0 100% /snap/core20/1778
/dev/loop5 squashfs 92M 92M 0 100% /snap/lxd/24061
/dev/loop6 squashfs 50M 50M 0 100% /snap/snapd/17950
tmpfs tmpfs 51G 20K 51G 1% /run/user/128
/dev/loop8 squashfs 56M 56M 0 100% /snap/core18/2679
tmpfs tmpfs 51G 4.0K 51G 1% /run/user/807325
/dev/loop0 squashfs 64M 64M 0 100% /snap/core20/1822
/dev/loop1 squashfs 50M 50M 0 100% /snap/snapd/18357
/dev/loop7 squashfs 56M 56M 0 100% /snap/core18/2697
sashok:
/home/sashok6/test.dat
From the df -Th
log
/home/sashok6/ has zfs file system which is not supported by GDS. Only ext4 and xfs are the local file systems supported by GDS currently. Please refer to this documentation for the same.
https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mount-local-fs
Ahh, that looks like the problem. Thanks for the help! I’ll look into getting an ext4 partition set up for further testing.