xusb crash with usb cameras on jetson TX1

With two intel realsense cameras and a single usb 2D camera on the usb bus it is easy for me to reproduce the following crash while starting our system up:

[  145.335287] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[  145.346662] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[  145.353938] mc-err:   status = 0x6000004a; addr = 0x00000000
[  145.359668] mc-err:   secure: no, access-type: read, SMMU fault: nr-nw-s
[  145.366483] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[  145.377880] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[  145.385156] mc-err:   status = 0x6000004a; addr = 0x00000000
[  145.390865] mc-err:   secure: no, access-type: read, SMMU fault: nr-nw-s
[  145.397651] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[  145.409028] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[  145.416295] mc-err:   status = 0x6000004a; addr = 0x00000000
[  145.422018] mc-err:   secure: no, access-type: read, SMMU fault: nr-nw-s
[  145.428839] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[  145.440199] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[  145.447484] mc-err:   status = 0x6000004a; addr = 0x00000000
[  145.453210] mc-err:   secure: no, access-type: read, SMMU fault: nr-nw-s
[  145.459989] mc-err: Too many MC errors; throttling prints
[  150.237341] tegra-xusb-mbox 70098000.mailbox: Controller firmware hang
[  150.244101] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_OWNER 0x0
[  150.251061] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_CMD 0x80000000
[  150.258478] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_DATA_IN 0x0
[  150.265585] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_DATA_OUT 0x0
[  155.381216] xhci-tegra 70090000.xusb: HC died; cleaning up
[  155.500053] uvcvideo: Failed to query (131) UVC probe control : -110 (exp. 34).
[  173.833378] xhci-tegra 70090000.xusb: Stopped the command ring failed, maybe the host is dead
[  173.878749] xhci-tegra 70090000.xusb: Abort command ring failed
[  173.884678] xhci-tegra 70090000.xusb: HC died; cleaning up

The system needs to be rebooted to restore usb connectivity. This crash happens with the new L4T 28.1 release and with the previous L4T 24.2.1. In the best case this seems to be a firmware error in the xusb controller. In the worst case it could be a hardware error in the controller.

Is this a known error and is someone working on it?
Are there any known workarounds?

This is an issue with the two xhci firmware versions:

Firmware timestamp: 2016-06-16 13:21:43 UTC, Version: 50.16 release

and

Firmware timestamp: 2016-11-24 02:31:08 UTC, Version: 50.18 release

(distributed with JetPack 28.1)

Hi brmrbt,
For more information, so you have two intel realsense cameras and one usb 2D camera connected to the default carrier board? Three usb cameras in total? What is the brand of the usb 2D camera?

The issue happens also with just the two realsense R200 cameras connected - perhaps a little less frequently than when we also have the 2D camera connected.

The crash happens very rarely when just enabling the depth stream from the two R200s but if I enable the two infra red streams and the 2D stream from both cameras in addition to the depth stream then the error happens quickly. The error happens when starting our application. This crash happened after 6 starts:

[619544.138239] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=fffffffff2
[619544.150155] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[619544.157738] mc-err: status = 0x6000004a; addr = 0x00000000
[619544.163689] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[619544.170784] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=fffffffff2
[619544.182350] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[619544.189681] mc-err: status = 0x6000004a; addr = 0x00000000
[619544.195439] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[619544.202250] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=fffffffff2
[619544.213666] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[619544.220989] mc-err: status = 0x6000004a; addr = 0x00000000
[619544.226746] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[619544.233552] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=fffffffff2
[619544.244945] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[619544.252260] mc-err: status = 0x6000004a; addr = 0x00000000
[619544.258002] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[619544.264791] mc-err: Too many MC errors; throttling prints
[619549.039820] tegra-xusb-mbox 70098000.mailbox: Controller firmware hang
[619549.046674] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_OWNER 0x0
[619549.053630] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_CMD 0x80000000
[619549.061005] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_DATA_IN 0x0
[619549.068117] tegra-xusb-mbox 70098000.mailbox: XUSB_CFG_ARU_MBOX_DATA_OUT 0x0
[619550.866909] xhci-tegra 70090000.xusb: xHCI host not responding to stop endpoint command.
[619550.875279] xhci-tegra 70090000.xusb: Assuming host is dying, halting host.
[619550.947317] xhci-tegra 70090000.xusb: Host not halted after 16000 microseconds.
[619550.954954] xhci-tegra 70090000.xusb: Non-responsive xHCI host is not halting.
[619550.962341] xhci-tegra 70090000.xusb: Completing active URBs anyway.

This application is running on the latest 28.1 release with librealsense and ROS (kinetic) on a Jetson evaluation board with an external USB 3.0 hub from TP link.

Can you explain what the smmu error means?

Hi brmbrt,
This is not a kernel crash but application is trying to do illegal memory address access leading to smmu errors. plaese check realsense sw stack.

After this crash all usb devices are dead including the ethernet controller.

I’m pretty sure that an application should not be able to do that with a null pointer exception.

My explanation for the smmu error is the xusb controller accesses address 0 as a DMA target and this causes the xusb to hang in the request and thus it crashes and brings down the kernel xhci driver.

This line gives the source of the fault:

mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry

csw_xusb_hostr here is the source of the fault which is the xusb as far as I can see and not an application running on the cpu.

Hi brmrbt, what function is running when the issue happens? Encoding, decoding, or some gpu processing?

We don’t have R200 and want to know if it is possible to simulate the case of realsense SW with general usb cameras.

Hi DaneLLL,

It is a start up issue.

The problem happens when we start streaming from two realsense cameras at the same time.
I tried the same test using 2 USB 2D cameras and that did not fail (I could not start three USB 2.0 cameras because of bandwidth limitations).

The crash is very rare if we just start one stream from each realsense camera (one in 5000 times) but becomes very frequent if several streams from each camera are started. The realsense cameras are special in several ways, They have multiple streams: a left and right infrared stream, a depth information stream and a separate 2D camera stream and they have a USB 3.0 interface instead of the standard USB 2.0 usb camera interface.

I think to reproduce this issue you would need at least 3 USB 3 2D cameras. Is it possible to get some debug information out of the xusb controller? It is very easy for us to reproduce. Another alternative is that we send you a pair of R200s and I make a test program that causes the failure that you can run.

Hi brmbrt, we cannot receive your devices. For now we are discussing on how to debug this further.

Hi brmbrt,
Please check if the firmware attached helps the issue.

/lib/firmware/tegra21x_xusb_firmware

After it is replaced, the firmware information should be

[   10.326547] xhci-tegra 70090000.xusb: Firmware timestamp: 2017-09-29 02:37:18 UTC, Version: 50.18 release

tegra21x_xusb_firmware.txt (124 KB)

I have now had time to test with the new firmware and it seems to fail in the same way:

[ 6.736056] tegra-xhci tegra-xhci: Firmware timestamp: 2017-09-29 02:37:18 UTC, Version: 50.18 release, Falcon state 0x20

[ 3069.032122] smmu_dump_pagetable(): fault_address=0x0000000000000000 pa=0xffffffffffffffff bytes=ffffffffffffffff #pte=0 in L2
[ 3069.049210] mc-err: (20) csr_xusb_hostr: EMEM decode error on PDE or PTE entry
[ 3069.060769] mc-err: status = 0x6000004a; addr = 0x00000000
[ 3069.067793] mc-err: secure: no, access-type: read, SMMU fault: nr-nw-s
[ 3072.237743] tegra-xhci tegra-xhci: Firmware reinit.
[ 3074.389626] tegra-xhci tegra-xhci: xHCI host not responding to stop endpoint command.
[ 3074.389754] tegra-xhci tegra-xhci: Assuming host is dying, halting host.
[ 3074.504890] tegra-xhci tegra-xhci: Host not halted after 16000 microseconds.
[ 3074.504981] tegra-xhci tegra-xhci: Non-responsive xHCI host is not halting.
[ 3074.505052] tegra-xhci tegra-xhci: Completing active URBs anyway.
[ 3074.506254] uvcvideo: Failed to query (SET_CUR) UVC control 1 on unit 7: -110 (exp. 2).
[ 3074.506750] uvcvideo: Failed to query (SET_CUR) UVC control 1 on unit 7: -110 (exp. 2).
[ 3074.506780] tegra-xhci tegra-xhci: HC died; cleaning up
[ 3074.506971] tegra_xhci_hcd_reinit: hcd_reinit is disabled

This was with kernel 3.10.96. I tried to remove the xhci kernel module (and that went fine) but inserting it in the kernel again caused the system to freeze.

Hi brmrbt,
r28.1 should be kernel 4.4. Do you try on r28.1?

If you think it will make a difference that I run on r28.1 then I will do that. Our current production system runs on the previous release with the 3.10.96 kernel so I wanted to test there first.

Hi brmbrt,
We don’t have intel r200, so need your help to provide more information.

Per your comment, the firmware doesn’t help on r24.2.1.

Hi, we are experiencing the same problem noted here (USB firmware controller crashes if we try to use a USB 3.0 camera) and in a more recent thread at https://devtalk.nvidia.com/default/topic/1029050/-quot-xhci-host-not-responding-to-stop-endpoint-command-quot-when-attempting-to-receive-frames-from-usb3-camera/. I’ve posted the details to the more recent thread in hopes we can continue to debug this.