Nvme timeout happened after upgrading from Jetpack 5.1 to Jetpack 6

chjej2021 · June 17, 2024, 3:46pm

Hello,

I have been utilizing the NVIDIA Jetson AGX Orin developer’s kit with Jetpack version 5.1 and experienced no issues running my application when attaching an NVMe SSD to a PCIe express slot via the NVMe PCIe expansion card.

However, after upgrading my Orin board to Jetpack 6.0 by flashing the eMMC, I encountered an issue where my TensorRT program hangs during inference. (A hang occurs approximately 1-2 times about 1000 inputs.)

Each instance of the program hang coincides with the following message in the dmesg log:

[256040.385692] nvme nvme0: I/O 413 QID 2 timeout, completion polled

The inference is stopped around 30 seconds, and the next inference can be performed after the above message is printed in the dmesg.

Given that my application heavily accesses NVMe storage, I suspect this issue may be related to NVMe access blocking, which also impacts TensorRT inference, causing delays in cudaStreamSynchronize.

Are there any NVMe-related issues after upgrading the Linux kernel from Jetpack 5.1 to Jetpack 6.0?

DaveYYY · June 18, 2024, 1:40am

Then does it happen when you directly connect the SSD to the M.2 slot on the board?
I don’t get why you need an PCIe expansion card when you have direct access to the M.2 slot.

Please put the full dmesg log here.

chjej2021 · June 18, 2024, 1:54am

The screw on the M2 slot of my AGX Orin board was not loosened, so I cannot use the M2 slot of the AGX Orin board. That’s why I connected the SSD using a PCIe expansion card.

The below file is a full dmesg log. Each line with “nvme nvme0: I/O XXX QID X timeout, completion polled” is printed when the delayed cudaStreamSynchronize is finished.

dmesg.log (69.9 KB)

DaveYYY · June 18, 2024, 2:34am

This is because the kernel somehow missed the I/O interrupt request from the NVMe driver.
Can you try lowering access to the disk, or maybe adjust nvpmodel and also do sudo jetson_clocks so the CPU runs faster?

chjej2021 · June 18, 2024, 3:48am

I am using MAXN mode, and I have already applied “sudo jetson_clocks.” The timeout occurred with the configuration you mentioned.

I am running a 3D object detection application with the nuScenes dataset, which contains 6,019 samples (the total data size is around 30GB). To reduce disk access, I added a 100-ms delay for handling each sample. However, the same timeout issue still occurs.

I have observed that the timeout issue does not occur when I run my application after executing the following command.

echo 3 > /proc/sys/vm/drop_caches

Could this observation assist you in diagnosing the issue?

DaveYYY · June 18, 2024, 5:56am

I think it’s an issue in the kernel itself.
Will you hit the same thing on an x86 PC also running kernel 5.15?
Or you may try upgrading the kernel on your Jetson device, as we do provide some flexibility on what kernel versions to use in JetPack 6:

chjej2021:

I have observed that the timeout issue does not occur when I run my application after executing the following command.
echo 3 > /proc/sys/vm/drop_caches
Could this observation assist you in diagnosing the issue?

I’m not sure, but since the kernel is now forced to look for files from the disk, maybe the I/O interrupt request is now handled.

DaveYYY · June 18, 2024, 5:57am

Or can you provide some sample code that can constantly reproduce this issue for us to check?

chjej2021 · June 18, 2024, 7:33am

I reproduced the issue with the CUDA-CenterPoint example.

For preparation, you need to create 6019 bin files by running eval_nusc.py --dump.

I was able to reproduce the same situation when I ran the CenterPoint program twice (the first run is used for filling the cache).

DaveYYY · June 18, 2024, 7:37am

OK, we will see.
Does it happen on all kinds of NVMe disks?

Would be great if you can also test this.

chjej2021 · June 18, 2024, 8:29am

I only have a single NVMe SSD. I am using SK Hynix P31 2TB.

I don’t have NVMe SSD in the Linux PC.
Since I don’t have a separate kernel build environment, I’ll attempt to upgrade the kernel when I have the time.

chjej2021 · June 21, 2024, 6:18am

After I upgraded the kernel from 5.15 to 6.6, nvme timeout and cudaStreamSynchronize delay no longer occurred.

The kernel upgrade is easily done as long as the defconfig is properly set.

Could you please let me know what has been fixed for this issue?

Thank you for your assistance.

DaveYYY · June 24, 2024, 1:14am

OK, it’s great you solve it with K6.6.
If it’s caused by the upstream kernel, then it’s not solely controlled by NVIDIA.
You may search in the Git repo where the code throwing error is located and see if you get some clue.

chjej2021 · June 27, 2024, 9:10am

After I ran the application multiple times, the error still occurs with low probability.
(around 1-2 times about 100,000 inputs)

I think the issue is not fully fixed in kernel 6.6.

Do I need to upgrade the kernel to a later version?

According to the “Upstream Patches”, the latest kernel version shown in the patch list is 6.7. Does this mean that 6.7 is the latest kernel I can try?

DaveYYY · June 28, 2024, 1:29am

You can give a try for K6.7.

chjej2021 · June 28, 2024, 6:17am

I upgraded the kernel to 6.7. Currently, no error occurs.
I will monitor the program for a while to see whether the same error happens or not.

After upgrading from 6.6 to 6.7, my application runs 20% faster than before.
Are there any changes related to performance?

chjej2021 · June 30, 2024, 4:39pm

I still get the same error in kernel 6.7 as in kernel 6.6.

DaveYYY · July 1, 2024, 1:22am

Does it happen on all kinds of NVMe disks?

chjej2021 · July 1, 2024, 7:55am

I only have one NVMe disk.

Have you tried reproducing this error?

Topic		Replies	Views
Jetson Orin Nano fails to mount rootfs when booting from NVMe Jetson Orin Nano boot	9	1096	June 14, 2024
JetPack 5.0.2 with PCIE Hub and NVME doesn't work Jetson Xavier NX pcie , board-design , nvme	15	1727	April 12, 2023
NVMe flashing timeout error: L4t 36.4.3 - AGX Orin Industrial Jetson AGX Orin flash	7	88	March 17, 2026
Jetson AGX orin Boot Stuck with "nvme0n1p1 not found" error Jetson AGX Orin boot	23	401	August 12, 2025
NVMe SSD compatibility for AGX Orin DevKit Jetson AGX Orin ssd , nvme	2	5953	September 29, 2022
Kernel: nvme nvme0: I/O 1 QID 5 timeout, aborting Jetson AGX Orin kernel	5	289	September 30, 2024
JetPack 6.2 - Intermittent NVMe PCIe probe failure on reboot (Orin NX, boot fails) Jetson Orin Nano pcie , reboot	9	76	June 9, 2026
JP 6 Flash Issue NVME Jetson AGX Orin reflash , nvbugs , nvme	38	2347	January 2, 2024
Nvgpu: 17000000.gpu nvgpu_channel_recover_from_wdt:112 [ERR] Job on channel 508 timed out Jetson AGX Orin	13	367	September 24, 2025
NVMe sometimes lost on reboot - pcie_aspm=off influence Jetson Orin NX boot , board-design , nvme	31	1270	July 12, 2024

Nvme timeout happened after upgrading from Jetpack 5.1 to Jetpack 6

Related topics