I have been utilizing the NVIDIA Jetson AGX Orin developer’s kit with Jetpack version 5.1 and experienced no issues running my application when attaching an NVMe SSD to a PCIe express slot via the NVMe PCIe expansion card.
However, after upgrading my Orin board to Jetpack 6.0 by flashing the eMMC, I encountered an issue where my TensorRT program hangs during inference. (A hang occurs approximately 1-2 times about 1000 inputs.)
Each instance of the program hang coincides with the following message in the dmesg log:
The inference is stopped around 30 seconds, and the next inference can be performed after the above message is printed in the dmesg.
Given that my application heavily accesses NVMe storage, I suspect this issue may be related to NVMe access blocking, which also impacts TensorRT inference, causing delays in cudaStreamSynchronize.
Are there any NVMe-related issues after upgrading the Linux kernel from Jetpack 5.1 to Jetpack 6.0?
Then does it happen when you directly connect the SSD to the M.2 slot on the board?
I don’t get why you need an PCIe expansion card when you have direct access to the M.2 slot.
The screw on the M2 slot of my AGX Orin board was not loosened, so I cannot use the M2 slot of the AGX Orin board. That’s why I connected the SSD using a PCIe expansion card.
The below file is a full dmesg log. Each line with “nvme nvme0: I/O XXX QID X timeout, completion polled” is printed when the delayed cudaStreamSynchronize is finished.
This is because the kernel somehow missed the I/O interrupt request from the NVMe driver.
Can you try lowering access to the disk, or maybe adjust nvpmodel and also do sudo jetson_clocks so the CPU runs faster?
I am using MAXN mode, and I have already applied “sudo jetson_clocks.” The timeout occurred with the configuration you mentioned.
I am running a 3D object detection application with the nuScenes dataset, which contains 6,019 samples (the total data size is around 30GB). To reduce disk access, I added a 100-ms delay for handling each sample. However, the same timeout issue still occurs.
I have observed that the timeout issue does not occur when I run my application after executing the following command.
echo 3 > /proc/sys/vm/drop_caches
Could this observation assist you in diagnosing the issue?
I only have a single NVMe SSD. I am using SK Hynix P31 2TB.
I don’t have NVMe SSD in the Linux PC.
Since I don’t have a separate kernel build environment, I’ll attempt to upgrade the kernel when I have the time.
OK, it’s great you solve it with K6.6.
If it’s caused by the upstream kernel, then it’s not solely controlled by NVIDIA.
You may search in the Git repo where the code throwing error is located and see if you get some clue.