During testing with cuFileBatchIO, I encountered a cuFile I/O error. The test involves two batchIO requests, each writing six 256KiB data blocks from six separate GPU buffers to NVMe-oF backend drives.
-
The first batchIO request performs 32KiB writes to a group of six drives and completes successfully.
-
The second batchIO request attempts 224KiB writes to a different set of six drives but fails due to “invalid size” errors reported by the
nvidia-fskernel driver.
The entries in the second batchIO request are classified as mixed I/Os because the device pointer offsets are not aligned to the GPU page size (64KB). However, it’s unclear why the nvidia-fs driver interprets these as invalid sizes.
Test environment:
-
GPU: A10-SXM4-40GB
-
CUDA version: 13.0
-
Driver version: 580.65.06
-
GDS release: 1.15.0.42
-
nvidia_fsversion: 2.26 -
libcufileversion: 2.12 -
MOFED: OFED-internal-25.07-0.9.7
Below are the details of the two batchIO requests:
devio[0]: 0x7eaabd79f6f0(fh:0xabababab000001d1): opcode=1, base=0x7eaf1ae80000, buf_off=0, file_off=278cd38000, size = 8000
devio[1]: 0x7eaabd79f870(fh:0xabababab000001c8): opcode=1, base=0x7eaf1aec0000, buf_off=0, file_off=278cd38000, size = 8000
devio[2]: 0x7eaabd79f9f0(fh:0xabababab000001cf): opcode=1, base=0x7eaf1af00000, buf_off=0, file_off=278cd38000, size = 8000
devio[3]: 0x7eaabd79fb70(fh:0xabababab000001bb): opcode=1, base=0x7eaf1af40000, buf_off=0, file_off=278cd38000, size = 8000
devio[4]: 0x7eaabd79fcf0(fh:0xabababab000001d2): opcode=1, base=0x7eaf1af80000, buf_off=0, file_off=278cd38000, size = 8000
devio[5]: 0x7eaabd79fe70(fh:0xabababab000001b8): opcode=1, base=0x7eaf1afc0000, buf_off=0, file_off=278cd38000, size = 8000devio[0]: 0x7eaabd79f720(fh:0xabababab000001cc): opcode=1, base=0x7eaf1ae80000, buf_off=8000, file_off=278cd40000, size = 38000
devio[1]: 0x7eaabd79f8a0(fh:0xabababab000001b7): opcode=1, base=0x7eaf1aec0000, buf_off=8000, file_off=278cd40000, size = 38000
devio[2]: 0x7eaabd79fa20(fh:0xabababab000001c5): opcode=1, base=0x7eaf1af00000, buf_off=8000, file_off=278cd40000, size = 38000
devio[3]: 0x7eaabd79fba0(fh:0xabababab000001a9): opcode=1, base=0x7eaf1af40000, buf_off=8000, file_off=274cd40000, size = 38000
devio[4]:0x7eaabd79fd20(fh:0xabababab000001ae): opcode=1, base=0x7eaf1af80000, buf_off=8000, file_off=274cd40000, size = 38000
devio[5]:0x7eaabd79fea0(fh:0xabababab000001ac): opcode=1, base=0x7eaf1afc0000, buf_off=8000, file_off=274cd40000, size = 38000
cufile_debug.log (35.7 KB)
nvidia_fs.log (900 Bytes)