Following my driver issue here, we worked with Dell to ensure all requirements are met.
We are running an Ubuntu 22.04, kernel 5.15.0-89, OFED 23.10-0.5.5.0, CUDA+GDS 12.3. GDS reports version 1.8.1.2 and nvidia_fs version 2.18, with libcufile version 2.12.
NVMe appears as supported and gdsio runs fine for read operations (we are using a modifiled cufile.json with compatibility support disabled to ensure GDS is being used), but the problem are the writes. Here’s the gdsio command we are using:
CUFILE_ENV_PATH_JSON=$PWD/cufile.debug.json /usr/local/cuda/gds/tools/gdsio -f /mnt/gpu5nvme1/test-fs-1G -d 5 -w 1 -s 1M -x 0 -i 4K:32K:1K -I 1
And the error that appears on the terminal:
write io failed of type 1 size: 4096 , ret: 0
failed to submit io of type 1 ret: -5
Error: IO failed stopping traffic, fd :199 ret:-5 errno :5
io failed :ret :-5 errno :5, file offset :0, block size :4096
This error also appears on cufile.log:
08-12-2023 19:20:36:178 [pid=112720 tid=112793] ERROR 0:1534 IOCTL failed io-type 1 ret -5 expected 4096 gpu_page_offset 0
And dmesg also shows signs that something is wrong:
[70192.609955] EXT4-fs (nvme1n1p1): mounted filesystem with ordered data mode. Opts: data=ordered. Quota mode: none.
[70192.628634] EXT4-fs (nvme2n1p1): mounted filesystem with ordered data mode. Opts: data=ordered. Quota mode: none.
[70331.701693] EXT4-fs (nvme1n1p1): mounted filesystem with ordered data mode. Opts: data=ordered. Quota mode: none.
[70331.720272] EXT4-fs (nvme2n1p1): mounted filesystem with ordered data mode. Opts: data=ordered. Quota mode: none.
[70699.408177] nvme_log_error: 24 callbacks suppressed
[70699.408181] nvme1n1: I/O Cmd(0x1) @ LBA 33295872, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70699.408928] print_req_error: 24 callbacks suppressed
[70699.408929] blk_update_request: I/O error, dev nvme1n1, sector 33295872 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70699.409844] nvidia-fs:write IO failed :-5
[70699.410182] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70699.410204] nvme1n1: I/O Cmd(0x1) @ LBA 33295360, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70699.410208] blk_update_request: I/O error, dev nvme1n1, sector 33295360 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70699.410215] nvidia-fs:write IO failed :-5
[70699.410217] nvme1n1: I/O Cmd(0x1) @ LBA 33294848, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70699.410221] blk_update_request: I/O error, dev nvme1n1, sector 33294848 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70699.410227] nvidia-fs:write IO failed :-5
[70699.415727] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70699.416577] nvidia-fs:write IO failed :-5
[70716.444523] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70716.444548] nvme1n1: I/O Cmd(0x1) @ LBA 33295872, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70716.444549] nvme1n1: I/O Cmd(0x1) @ LBA 33294848, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70716.444553] blk_update_request: I/O error, dev nvme1n1, sector 33294848 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70716.444562] nvidia-fs:write IO failed :-5
[70716.444561] nvme1n1: I/O Cmd(0x1) @ LBA 33295360, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70716.444565] blk_update_request: I/O error, dev nvme1n1, sector 33295360 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70716.444573] nvidia-fs:write IO failed :-5
[70716.445259] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70716.446094] blk_update_request: I/O error, dev nvme1n1, sector 33295872 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70716.446906] nvidia-fs:write IO failed :-5
[70716.447709] nvidia-fs:write IO failed :-5
[70737.739351] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[70737.739990] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[70737.740587] nvidia-fs:write IO failed :-5
[71251.762593] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[71251.763278] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[71251.763966] nvidia-fs:write IO failed :-5
[71419.209304] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[71419.210025] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[71419.210845] nvidia-fs:write IO failed :-5
[71495.212428] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[71495.213102] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[71495.213689] nvidia-fs:write IO failed :-5
[71538.805915] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[71538.806579] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[71538.807212] nvidia-fs:write IO failed :-5
[72601.734463] nvme1n1: I/O Cmd(0x1) @ LBA 33294336, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[72601.735263] blk_update_request: I/O error, dev nvme1n1, sector 33294336 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[72601.736064] nvidia-fs:write IO failed :-5
[74837.482947] nvme1n1: I/O Cmd(0x1) @ LBA 29100032, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[74837.483610] blk_update_request: I/O error, dev nvme1n1, sector 29100032 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[74837.484241] nvidia-fs:write IO failed :-5
[75578.523266] nvme1n1: I/O Cmd(0x1) @ LBA 2298759168, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[75578.523950] blk_update_request: I/O error, dev nvme1n1, sector 2298759168 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[75578.524617] nvidia-fs:write IO failed :-5
[75883.712121] nvme1n1: I/O Cmd(0x1) @ LBA 2298759168, 8 blocks, I/O Error (sct 0x0 / sc 0x4)
[75883.712828] blk_update_request: I/O error, dev nvme1n1, sector 2298759168 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[75883.713580] nvidia-fs:write IO failed :-5
I double checked that the input/output file is not sparse.
Any suggestions on how to fix this issue?
Thanks