cuFileBatchIO fails with mixed IOs

During testing with cuFileBatchIO, I encountered a cuFile I/O error. The test involves two batchIO requests, each writing six 256KiB data blocks from six separate GPU buffers to NVMe-oF backend drives.

  • The first batchIO request performs 32KiB writes to a group of six drives and completes successfully.

  • The second batchIO request attempts 224KiB writes to a different set of six drives but fails due to “invalid size” errors reported by the nvidia-fs kernel driver.

The entries in the second batchIO request are classified as mixed I/Os because the device pointer offsets are not aligned to the GPU page size (64KB). However, it’s unclear why the nvidia-fs driver interprets these as invalid sizes.

Test environment:

  • GPU: A10-SXM4-40GB

  • CUDA version: 13.0

  • Driver version: 580.65.06

  • GDS release: 1.15.0.42

  • nvidia_fs version: 2.26

  • libcufile version: 2.12

  • MOFED: OFED-internal-25.07-0.9.7

Below are the details of the two batchIO requests:

devio[0]: 0x7eaabd79f6f0(fh:0xabababab000001d1): opcode=1, base=0x7eaf1ae80000, buf_off=0, file_off=278cd38000, size = 8000
devio[1]: 0x7eaabd79f870(fh:0xabababab000001c8): opcode=1, base=0x7eaf1aec0000, buf_off=0, file_off=278cd38000, size = 8000
devio[2]: 0x7eaabd79f9f0(fh:0xabababab000001cf): opcode=1, base=0x7eaf1af00000, buf_off=0, file_off=278cd38000, size = 8000
devio[3]: 0x7eaabd79fb70(fh:0xabababab000001bb): opcode=1, base=0x7eaf1af40000, buf_off=0, file_off=278cd38000, size = 8000
devio[4]: 0x7eaabd79fcf0(fh:0xabababab000001d2): opcode=1, base=0x7eaf1af80000, buf_off=0, file_off=278cd38000, size = 8000
devio[5]: 0x7eaabd79fe70(fh:0xabababab000001b8): opcode=1, base=0x7eaf1afc0000, buf_off=0, file_off=278cd38000, size = 8000

devio[0]: 0x7eaabd79f720(fh:0xabababab000001cc): opcode=1, base=0x7eaf1ae80000, buf_off=8000, file_off=278cd40000, size = 38000
devio[1]: 0x7eaabd79f8a0(fh:0xabababab000001b7): opcode=1, base=0x7eaf1aec0000, buf_off=8000, file_off=278cd40000, size = 38000
devio[2]: 0x7eaabd79fa20(fh:0xabababab000001c5): opcode=1, base=0x7eaf1af00000, buf_off=8000, file_off=278cd40000, size = 38000
devio[3]: 0x7eaabd79fba0(fh:0xabababab000001a9): opcode=1, base=0x7eaf1af40000, buf_off=8000, file_off=274cd40000, size = 38000
devio[4]:0x7eaabd79fd20(fh:0xabababab000001ae): opcode=1, base=0x7eaf1af80000, buf_off=8000, file_off=274cd40000, size = 38000
devio[5]:0x7eaabd79fea0(fh:0xabababab000001ac): opcode=1, base=0x7eaf1afc0000, buf_off=8000, file_off=274cd40000, size = 38000

cufile_debug.log (35.7 KB)

nvidia_fs.log (900 Bytes)