Unplugging a USB flash causes tegra-capture-vi: uncorr_err: request timed out after 2500 ms

We used AGX Xavier with L4T version R35.5.0.
We have 3 external sensors on AGX Xavier via CSI interface.

Under normal circumstances, the 3 sensors can produce images normally. However, when we insert a USB flash drive, copy a large file (3GB or more) to the USB flash drive, and we unplug the USB flash drive while the file is still in the process of being copied, all 3 VIs report the following error:

[ 2845.152403] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 2845.152415] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 2845.152503] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms

Theoretically there is no correlation between VI and USB, so why would USB affect VI?
The probability of this problem recurring is nearly a percentage, and the following is the kernel printout at the time of the problem:
Please help us, thanks!

2.txt (18.8 KB)

In addition, we found that: tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms is also reported when the system call is too frequent.
It feels like the kernel threads are not being scheduled in time? (This is our guess).

I’m the wrong guy to answer, but I’ll add some preliminary information that might be related.

First, never unplug your USB drive without first using umount (or one of the “eject” functions). You’re damaging the filesystem on the USB device, which needs correction at the next plug-in. Correction involves losing data.

Next, it is only the first CPU core which handles many hardware devices. This would include the external device in most cases, and the camera sensors. When operation is “normal” the behavior is defined. As soon as you hit an undefined behavior of something that shouldn’t be done, there is a possibility that the response is also undefined, and ends up as an error. In your case the reason this might matter is you are tying up CPU0, which is also handling your sensors. Until that error is resolved (such as through timing out) it is quite possible that the sensors cannot preempt the error handler, and the error spreads.

So the short answer is that you are probably right about the kernel threads not being scheduled in time. Even if you have RT kernel patch this is not hard realtime hardware, and since it all converges on CPU0, the answer is probably “don’t do that” (removing a memory device while in use without safe eject or umount).

Thank you very much for your reply! We have benefited greatly from your response.
We also feel that we shouldn’t unplug the USB storage before umount, but the customer has this application scenario, so we had to troubleshoot.
Do you have any troubleshooting ideas for this issue? We need to troubleshoot to a point where we can convince the customer.

Unplugging the memory device will always cause problems, and will lose content on the memory device. It is also probably true that this should not crash the system since the filesystem of the o/s is not on the device.

The “best” scenario: NVIDIA or someone working with the source code of your camera software would code this to fail gracefully. This camera software would still fail, and the program would need to be restarted, and data would be lost on the memory device. However, the system would otherwise continue as up and running. Maybe @KevinFFF or @WayneWWW could comment on who maintains the software producing this debug output:

[ 2855.072552] tegra194-vi5 15c10000.vi: vi_capture_release: release channel IVC failed
[ 2855.072736] ------------[ cut here ]------------
[ 2855.072848] WARNING: CPU: 3 PID: 146483 at /home/gitlab-runner/builds/x2eF_KiB/0/csw/jetson/l4t/r3550/kernel/nvidia/drivers/media/platform/tegra/camera/fusa-capture/capture-vi.c:972 vi_capture_release+0x2b4/0x2e0

The only purpose of that would be to have a graceful failure rather than to not fail (you cannot make this not fail when pulling an SD card out).

You could redesign this. AGX Xavier has eMMC which the customer cannot just pull out from under the system. It is very likely the removable media is due to the size of data far exceeding the eMMC size. What you could do is to have the camera record to the eMMC, up to some smaller size, and have a second thread/program which then periodically copies (or appends) this to SD card (or thumb drive or whatever the external media is); then start overwriting that eMMC file to reuse that space. The scheme being record smaller amounts from the camera to a location the camera save is not sensitive to getting suddenly removed, move it or append it elsewhere via another program which can fail gracefully since it is not sensitive to bringing down the system upon failure.

As an example, suppose your camera software starts out saving to “/tmp/username/data.0”. At 1 MB of data, it then switches to recording to “/tmp/username/data.1”. When that reaches 1 MB, it switches to recording to “/tmp/username/data.2”. So on, until it finishes recording to “/tmp/username/data.9”. After this it starts over recording at “/tmp/username/data.0” (there is a small amount of time to copy small blocks). Each time any file is written to a lock is created; whenever the lock is released the file is appended to SD card, and then the tmp file is deleted (basically a ring buffer). There are all kinds of schemes for this, and variations. The gist is that if you have removable media and your customer is not going follow a better procedure, then you cannot write to the removable media using a program that brings the system down.

Or if this has custom hardware, perhaps you could provide some alternative “easy” way to safely eject the device after turning off the camera capture. The method with the ring buffer would probably be a performance failure if this is a high resolution camera. I say this because now you would be reading and writing simultaneously. A write to eMMC by camera, while reading a “/tmp/username/data.*” file, and writing to removable media (appending a file chunk on a regular basis).

The simple fix is that your customer stop the video program and properly umount the SD card (this can be made more convenient) before pulling it. This is the only way to also not damage the SD card’s filesystem and losing data.

Incidentally, a variation on all of this is to create a couple of ramdisks. Swap often does this since it avoids writing to the solid state memory (which eventually wears it out). However, you can only do that if you have sufficient RAM. You could pre-create 10 ramdisks and substitute those for the “/tmp/username/data.*”. 10 MB of RAM isn’t too bad, but if you format that ramdisk as ext4, then there is overhead and it won’t be a full MB on each chunk/block. Performance though would be superior to writing to disk at first; this would be beneficial on average throughput since it is a buffer. Latency goes up no matter what you do with a swap scheme, but ramdisk impact would be minimal.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.