Orin USB problems

Dear nvidia team,

We are using Orin Developer kit and encountered a problem when working with USB serial device. Serial connection runs fine for a while, and then gets stuck somewhere in a Linux kernel.

Here are the key points that I’m identified:

  1. Device works consistently fine on x64 host. Problem is reproducible on Orin only.
  2. Device works fine on Orin for some time. It is very likely enter the stuck state, however, sooner or later.
  3. Device runs parallel with USB3Vision camera. When the issue manifests itself, camera is also impacted (FPS is dropped temporarily).
  4. Device is plugged directly into a dedicated USB port and does not have any hubs in between. Camera is also plugged into it’s dedicated USB port. Device does not have it’s own power supply and is power-cycled on every unplugging.
  5. If you unplug a device and plug it back, Orin will be unable to even enumerate it. Enumeration issue persists until Orin is rebooted.
  6. If you plug a logic analyzer into USB bus after the enumeration issue, you won’t see any traffic. Only regular SOF’s will be present on bus. This persists even after device unplug - no USB reset and no SETUP packets are sent. Meantime, Orin will complain about descriptor read error, which isn’t surprising because descriptors were never requested.
  7. If you plug a logic analyzer into USB bus before device reconnection (so enumeration issue won’t occur), you won’t see any trafic either. Orin will poll the device with IN packets, but will not sent any OUT packet even when explicitly requested by echo 123 > /dev/ttyACMx.
  8. Echo process will get stuck in a syscall:
     [<0>] __switch_to+0xc8/0x120
     [<0>] usb_start_wait_urb+0x94/0x100
     [<0>] usb_control_msg+0xc4/0x140
     [<0>] 0xffffac2b0b82dfac
     [<0>] 0xffffac2b0b82ec24
     [<0>] tty_port_block_til_ready+0x1e0/0x320
     [<0>] tty_port_open+0xcc/0x110
     [<0>] 0xffffac2b0b82d924
     [<0>] tty_open+0x130/0x530
     [<0>] chrdev_open+0xac/0x1b0
     [<0>] do_dentry_open+0x134/0x3a0
     [<0>] vfs_open+0x3c/0x50
     [<0>] path_openat+0x858/0xde0
     [<0>] do_filp_open+0x88/0x110
     [<0>] do_sys_openat2+0x1fc/0x2b0
     [<0>] do_sys_open+0x80/0xd0
     [<0>] __arm64_sys_openat+0x30/0x40
     [<0>] el0_svc_common.constprop.0+0x80/0x1d0
     [<0>] do_el0_svc+0x38/0xb0
     [<0>] el0_svc+0x1c/0x30
     [<0>] el0_sync_handler+0xa8/0xb0
     [<0>] el0_sync+0x16c/0x180
  9. No specific dmesg messages are produced when issue manifests itself. All dmesg messages are post-mortem - i.e. complains about descriptor read failure. If you don’t do any interactions, device could remain in a stuck state for indefinite time without any messages reported.
  10. USB port effectively becomes dead - other devices are not enumerated as well.

I understand that you will not be able to reproduce this issue by yourself, because issue is probably specific to a particular setup. However, I can collect more detailed dumps and debug info if you will need them.

Noticed that other folks on this forum are having similar issues:

Hi makkarpov,

Do you mean it would stay in the stuck state and could not recover?

Have you tried to reproduce on other platforms (like Xavier NX, AGX Xavier …etc)?

We have a known issue about USB serial in Jetpack5.1(R35.2.1) for Orin.
Could you also help to verify in the next release when it is available?

Do you mean it would stay in the stuck state and could not recover?

We waited for quite a bit and never had seen for a port to recover. However, it recovers in some cases if you do voodoo manipulations like unplugging everything and plugging it back.

Have you tried to reproduce on other platforms (like Xavier NX, AGX Xavier …etc)?

Unfortunately, this would not be easy to test. The device works with the camera analysis system, and reproducing that on Xavier would require writing a synthetic test program. Xavier cannot run the system in real-time due to resource constraints.

Also, I was able to reproduce the issue with a custom class USB device and libusb1. Overall symptoms are the same - all traffic stops after a short time (10-30 minutes of system operation). A logic analyzer again shows nothing but SOFs. From the libusb point of view, the device abruptly stopped sending and receiving data. Restarting the process does not help. The device could stay in this state for tens of minutes at least.

But something is different:

  1. The device has a hub in between. All other hub ports continue to work.
  2. Replugging the device restores the operation.

A custom class is much more stable than CDC serial. However, it is still very unstable in terms of absolute reliability.