Nvme timeout happened after upgrading from Jetpack 5.1 to Jetpack 6

Hello,

I have been utilizing the NVIDIA Jetson AGX Orin developer’s kit with Jetpack version 5.1 and experienced no issues running my application when attaching an NVMe SSD to a PCIe express slot via the NVMe PCIe expansion card.

However, after upgrading my Orin board to Jetpack 6.0 by flashing the eMMC, I encountered an issue where my TensorRT program hangs during inference. (A hang occurs approximately 1-2 times about 1000 inputs.)

Each instance of the program hang coincides with the following message in the dmesg log:

[256040.385692] nvme nvme0: I/O 413 QID 2 timeout, completion polled

The inference is stopped around 30 seconds, and the next inference can be performed after the above message is printed in the dmesg.

Given that my application heavily accesses NVMe storage, I suspect this issue may be related to NVMe access blocking, which also impacts TensorRT inference, causing delays in cudaStreamSynchronize.

Are there any NVMe-related issues after upgrading the Linux kernel from Jetpack 5.1 to Jetpack 6.0?

1 Like

Then does it happen when you directly connect the SSD to the M.2 slot on the board?
I don’t get why you need an PCIe expansion card when you have direct access to the M.2 slot.

Please put the full dmesg log here.

The screw on the M2 slot of my AGX Orin board was not loosened, so I cannot use the M2 slot of the AGX Orin board. That’s why I connected the SSD using a PCIe expansion card.

The below file is a full dmesg log. Each line with “nvme nvme0: I/O XXX QID X timeout, completion polled” is printed when the delayed cudaStreamSynchronize is finished.

dmesg.log (69.9 KB)

This is because the kernel somehow missed the I/O interrupt request from the NVMe driver.
Can you try lowering access to the disk, or maybe adjust nvpmodel and also do sudo jetson_clocks so the CPU runs faster?

I am using MAXN mode, and I have already applied “sudo jetson_clocks.” The timeout occurred with the configuration you mentioned.

I am running a 3D object detection application with the nuScenes dataset, which contains 6,019 samples (the total data size is around 30GB). To reduce disk access, I added a 100-ms delay for handling each sample. However, the same timeout issue still occurs.


I have observed that the timeout issue does not occur when I run my application after executing the following command.

echo 3 > /proc/sys/vm/drop_caches

Could this observation assist you in diagnosing the issue?

I think it’s an issue in the kernel itself.
Will you hit the same thing on an x86 PC also running kernel 5.15?
Or you may try upgrading the kernel on your Jetson device, as we do provide some flexibility on what kernel versions to use in JetPack 6:
https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/Kernel/BringYourOwnKernel.html

I’m not sure, but since the kernel is now forced to look for files from the disk, maybe the I/O interrupt request is now handled.

Or can you provide some sample code that can constantly reproduce this issue for us to check?

I reproduced the issue with the CUDA-CenterPoint example.

For preparation, you need to create 6019 bin files by running eval_nusc.py --dump.

I was able to reproduce the same situation when I ran the CenterPoint program twice (the first run is used for filling the cache).

OK, we will see.
Does it happen on all kinds of NVMe disks?

Would be great if you can also test this.

I only have a single NVMe SSD. I am using SK Hynix P31 2TB.

I don’t have NVMe SSD in the Linux PC.
Since I don’t have a separate kernel build environment, I’ll attempt to upgrade the kernel when I have the time.

After I upgraded the kernel from 5.15 to 6.6, nvme timeout and cudaStreamSynchronize delay no longer occurred.

The kernel upgrade is easily done as long as the defconfig is properly set.

Could you please let me know what has been fixed for this issue?

Thank you for your assistance.

OK, it’s great you solve it with K6.6.
If it’s caused by the upstream kernel, then it’s not solely controlled by NVIDIA.
You may search in the Git repo where the code throwing error is located and see if you get some clue.

After I ran the application multiple times, the error still occurs with low probability.
(around 1-2 times about 100,000 inputs)

I think the issue is not fully fixed in kernel 6.6.

Do I need to upgrade the kernel to a later version?

According to the “Upstream Patches”, the latest kernel version shown in the patch list is 6.7. Does this mean that 6.7 is the latest kernel I can try?

You can give a try for K6.7.

I upgraded the kernel to 6.7. Currently, no error occurs.
I will monitor the program for a while to see whether the same error happens or not.

After upgrading from 6.6 to 6.7, my application runs 20% faster than before.
Are there any changes related to performance?

I still get the same error in kernel 6.7 as in kernel 6.6.

Does it happen on all kinds of NVMe disks?

I only have one NVMe disk.

Have you tried reproducing this error?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.