Xavier becomes unresponsive and reboots

Hi, I am making use of the NUSCENES devkit (https://github.com/nutonomy/nuscenes-devkit) and dataset on my Xavier, alongside the SECOND (https://github.com/traveller59/second.pytorch) package to do training using the Nuscenes dataset stored on an external SSD.

I have attached an external SSD on my Jetson Xavier, and whenever i try to access the SSD, often times the Xavier becomes unresponsive and reboots. This becomes very frequent at some point.

The Xavier reboots in the middle of training and I am forced to start over each time.

I have tried flushing the Xavier and no luck.

The SSD is a “SAMSUNG VNAND SSD 860 EVO 500GB”

Any solution?

There is a reasonable chance that you are running out of memory. However, you will want to run a serial console and post the log of what goes on just before and during the failure.

For serial console it is just the first serial device which shows up upon plugging the micro-B USB into a host PC (e.g., run “dmesg --follow” on the host, and then plug in the micro-B USB to the Xavier…the first serial device name will look something like “/dev/ttyUSB0”). I like gtkterm (on the host, “sudo apt-get install gtkterm”), and if the device is ttyUSB0, then this would start a connection to serial console:

gtkterm -b 8 -t 1 -s 115200 -p /dev/ttyUSB0

(you would have to use “sudo” if your user is not a member of group “dialout”)

Incidentally, you could run “dmesg --follow” on the serial console itself so this would display logs up to the point of failure. The logs occurring as the program starts (or during failure) would be of use.

Currently i have 27%/8 GB free space, could this be an issue? and if it happens to be a memory issue is it possible to extend the memory? Because i cannot free up any space, all files stored here are related to my current project.

By “the serial console itself” do you mean on the Xavier or on the machine connected to the Xavier via micro-USB?

[   15.070834] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   98.022244] fuse init (API version 7.26)
[  868.496148] sd 2:0:0:0: [sda] tag#27 uas_eh_abort_handler 0 uas-tag 28 inflight: CMD OUT
[  868.496180] sd 2:0:0:0: [sda] tag#27 CDB: opcode=0x2a 2a 00 13 72 b1 c0 00 00 10 00
[  869.509111] sd 2:0:0:0: [sda] tag#26 uas_eh_abort_handler 0 uas-tag 27 inflight: CMD OUT
[  869.509160] sd 2:0:0:0: [sda] tag#26 CDB: opcode=0x2a 2a 00 13 71 b7 a8 00 00 08 00
[  870.521890] sd 2:0:0:0: [sda] tag#25 uas_eh_abort_handler 0 uas-tag 26 inflight: CMD OUT
[  870.521920] sd 2:0:0:0: [sda] tag#25 CDB: opcode=0x2a 2a 00 13 71 14 a0 00 00 08 00
[  871.534725] sd 2:0:0:0: [sda] tag#24 uas_eh_abort_handler 0 uas-tag 25 inflight: CMD OUT
[  871.534756] sd 2:0:0:0: [sda] tag#24 CDB: opcode=0x2a 2a 00 13 70 7b 70 00 00 50 00
[  872.547631] sd 2:0:0:0: [sda] tag#23 uas_eh_abort_handler 0 uas-tag 24 inflight: CMD OUT
[  872.547658] sd 2:0:0:0: [sda] tag#23 CDB: opcode=0x2a 2a 00 13 6f ed f8 00 00 20 00
[  873.560432] sd 2:0:0:0: [sda] tag#22 uas_eh_abort_handler 0 uas-tag 23 inflight: CMD OUT
[  873.560465] sd 2:0:0:0: [sda] tag#22 CDB: opcode=0x2a 2a 00 13 6f 62 78 00 00 18 00
[  874.573320] sd 2:0:0:0: [sda] tag#21 uas_eh_abort_handler 0 uas-tag 22 inflight: CMD OUT
[  874.573349] sd 2:0:0:0: [sda] tag#21 CDB: opcode=0x2a 2a 00 13 6e 7e 68 00 00 f8 00
[  875.586170] sd 2:0:0:0: [sda] tag#20 uas_eh_abort_handler 0 uas-tag 21 inflight: CMD OUT
[  875.586194] sd 2:0:0:0: [sda] tag#20 CDB: opcode=0x2a 2a 00 13 6c 8e a0 00 00 08 00
[  876.598914] sd 2:0:0:0: [sda] tag#19 uas_eh_abort_handler 0 uas-tag 20 inflight: CMD OUT
[  876.598939] sd 2:0:0:0: [sda] tag#19 CDB: opcode=0x2a 2a 00 13 64 11 c8 00 00 18 00
[  877.611711] sd 2:0:0:0: [sda] tag#18 uas_eh_abort_handler 0 uas-tag 19 inflight: CMD OUT
[  877.611733] sd 2:0:0:0: [sda] tag#18 CDB: opcode=0x2a 2a 00 13 63 3f a0 00 00 98 00
[  878.624477] sd 2:0:0:0: [sda] tag#17 uas_eh_abort_handler 0 uas-tag 18 inflight: CMD OUT
[  878.624504] sd 2:0:0:0: [sda] tag#17 CDB: opcode=0x2a 2a 00 13 61 01 e8 00 00 08 00
[  879.637169] sd 2:0:0:0: [sda] tag#16 uas_eh_abort_handler 0 uas-tag 17 inflight: CMD OUT
[  879.637196] sd 2:0:0:0: [sda] tag#16 CDB: opcode=0x2a 2a 00 13 5c 89 c0 00 00 40 00
[  880.649882] sd 2:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 16 inflight: CMD OUT
[  880.649922] sd 2:0:0:0: [sda] tag#3 CDB: opcode=0x2a 2a 00 13 5b 29 30 00 00 08 00
[  881.662658] sd 2:0:0:0: [sda] tag#2 uas_eh_abort_handler 0 uas-tag 15 inflight: CMD OUT
[  881.662702] sd 2:0:0:0: [sda] tag#2 CDB: opcode=0x2a 2a 00 13 53 0d 10 00 00 08 00
[  882.675433] sd 2:0:0:0: [sda] tag#1 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[  882.675462] sd 2:0:0:0: [sda] tag#1 CDB: opcode=0x2a 2a 00 13 50 f4 00 00 00 10 00
[  883.688129] sd 2:0:0:0: [sda] tag#15 uas_eh_abort_handler 0 uas-tag 14 inflight: CMD OUT
[  883.688194] sd 2:0:0:0: [sda] tag#15 CDB: opcode=0x2a 2a 00 13 45 90 00 00 00 08 00
[  884.700836] sd 2:0:0:0: [sda] tag#14 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD OUT
[  884.700862] sd 2:0:0:0: [sda] tag#14 CDB: opcode=0x2a 2a 00 13 45 0c e8 00 00 08 00
[  885.713550] sd 2:0:0:0: [sda] tag#13 uas_eh_abort_handler 0 uas-tag 13 inflight: CMD OUT
[  885.713572] sd 2:0:0:0: [sda] tag#13 CDB: opcode=0x2a 2a 00 13 41 ae 00 00 00 18 00
[  886.726250] sd 2:0:0:0: [sda] tag#12 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT
[  886.726283] sd 2:0:0:0: [sda] tag#12 CDB: opcode=0x2a 2a 00 13 3d f1 18 00 00 08 00
[  887.739035] sd 2:0:0:0: [sda] tag#11 uas_eh_abort_handler 0 uas-tag 12 inflight: CMD OUT
[  887.739057] sd 2:0:0:0: [sda] tag#11 CDB: opcode=0x2a 2a 00 13 3d 65 38 00 00 10 00
[  888.751853] sd 2:0:0:0: [sda] tag#10 uas_eh_abort_handler 0 uas-tag 11 inflight: CMD OUT
[  888.751885] sd 2:0:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 13 3c 98 78 00 00 08 00
[  889.764593] sd 2:0:0:0: [sda] tag#9 uas_eh_abort_handler 0 uas-tag 10 inflight: CMD OUT
[  889.764626] sd 2:0:0:0: [sda] tag#9 CDB: opcode=0x2a 2a 00 13 3b db e8 00 00 08 00
[  890.777482] sd 2:0:0:0: [sda] tag#8 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT
[  890.777557] sd 2:0:0:0: [sda] tag#8 CDB: opcode=0x2a 2a 00 13 33 94 d0 00 00 08 00
[  891.790233] sd 2:0:0:0: [sda] tag#7 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT
[  891.790262] sd 2:0:0:0: [sda] tag#7 CDB: opcode=0x2a 2a 00 13 28 0b 08 00 00 08 00
[  892.802989] sd 2:0:0:0: [sda] tag#6 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT
[  892.803017] sd 2:0:0:0: [sda] tag#6 CDB: opcode=0x2a 2a 00 13 26 e5 88 00 00 08 00
[  893.815652] sd 2:0:0:0: [sda] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[  893.815729] sd 2:0:0:0: [sda] tag#5 CDB: opcode=0x2a 2a 00 13 26 65 50 00 00 08 00
[  894.828364] sd 2:0:0:0: [sda] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT
[  894.828453] sd 2:0:0:0: [sda] tag#4 CDB: opcode=0x2a 2a 00 13 25 e5 50 00 00 18 00
[  895.841204] sd 2:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
[  895.841233] sd 2:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 13 0b 05 90 00 00 08 00
[  896.853967] scsi host2: uas_eh_bus_reset_handler start
[  897.866666] usb 2-1: cmd cmplt err -2
[  898.879598] usb 2-1: cmd cmplt err -2
[  899.892649] usb 2-1: cmd cmplt err -2
[  900.905643] usb 2-1: cmd cmplt err -2
[  901.918513] usb 2-1: cmd cmplt err -2
[  902.931409] usb 2-1: cmd cmplt err -2
[  903.944258] usb 2-1: cmd cmplt err -2
[  904.957188] usb 2-1: cmd cmplt err -2
[  905.970182] usb 2-1: cmd cmplt err -2
[  906.983077] usb 2-1: cmd cmplt err -2
[  907.995897] usb 2-1: cmd cmplt err -2
[  909.008732] usb 2-1: cmd cmplt err -2
[  910.021586] usb 2-1: cmd cmplt err -2
[  911.034410] usb 2-1: cmd cmplt err -2
[  912.047264] usb 2-1: cmd cmplt err -2
[  913.060237] usb 2-1: cmd cmplt err -2
[  914.073113] usb 2-1: cmd cmplt err -2
[  915.086041] usb 2-1: cmd cmplt err -2
[  916.098837] usb 2-1: cmd cmplt err -2
[  917.111629] usb 2-1: cmd cmplt err -2
[  918.124513] usb 2-1: cmd cmplt err -2
[  919.137391] usb 2-1: cmd cmplt err -2
[  920.150276] usb 2-1: cmd cmplt err -2
[  921.163173] usb 2-1: cmd cmplt err -2
[  922.176001] usb 2-1: cmd cmplt err -2
[  929.397375] usb 2-1: Disable of device-initiated U1 failed.
[  935.541040] usb 2-1: Disable of device-initiated U2 failed.
[  936.673008] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  936.880396] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  937.088155] usb 2-1: device not accepting address 2, error -22
[  938.248478] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  938.456381] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  938.664086] usb 2-1: device not accepting address 2, error -22
[  938.812410] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  939.020271] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  939.228026] usb 2-1: device not accepting address 2, error -22
[  939.376364] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  939.584215] tegra-xusb 3610000.xhci: ERROR: unexpected setup address command completion code 0x7.
[  939.791989] usb 2-1: device not accepting address 2, error -22
[  939.848116] usb 2-1: USB disconnect, device number 2
packet_write_wait: Connection to 10.0.0.25 port 22: Broken pipe
tangos@vostro-dev:~$

This is the last output of

dmesg --follow

during the training and just before the Xavier reboots. take note this was via SSH

gtkterm command did not output anything, it only shows a dark interface.

Serial console implies the software running on the host PC (which is working as the display for the embedded system the software talks to over serial UART). There should be some obvious boot messages as the system boots, but it is perfectly reasonable that there is no output during the error (disappointing, but not uncommon…it simply means that if the console is working, then the reboot cause is so sudden that no logging can occur).

I see a lot of serious USB errors, also SATA errors. This leads to these questions:

  • Is the SSD connected directly, or is the disk using a USB external drive housing?
  • Does the NUSCENES devkit use a custom carrier board, or is it just software?
    • If this uses a custom carrier board, did you use the board support package the carrier board comes with?
  • Are any external USB or disk drive devices (excluding keyboard/mouse) using their own power, or are they drawing power from the Xavier?
  • The SSD is connected directly via a type-C port
  • The NUSCENES devkit is just python software, it does not use/require any custom carrier board
  • The SSD has a 3.0 connector with a couple of lights on it (I assume thats supposed to draw power from the Xavier) could that be the cause?

The errors only say that USB-C and the SSD connected over that USB-C are having serious problems. The cause could be power delivery. Assuming the SSD is not being self-powered, meaning that the SSD is drawing power from the Xavier, then if you happen to have an externally powered USB HUB, the externally powered HUB would eliminate power consumption as the issue. External power would draw from a different source than the Xavier.

Can you confirm if there are problems in the case of the SSD being independently powered via an external power source? Note that there could have been data corruption on the SSD from previous issues, but what the logs were showing were not file system errors; instead those were USB and SATA errors.

It seems like the SSD was the issue, I copied the data from the SSD to a normal HDD and it works fine without rebooting. I don’t know if the problem is specific to SSD drives when connected the Xavier or the one I have is at fault, I will format it re-add the data to see if it persists.

If power delivery was ever an issue, then data would be suspect. External power would still be a good test.

Noted, Thanks, I will update as soon as i retest with the SSD.

I’ve test external power supply and the problem still continues to persist… I have run out of ideas now on how to fix this issue. I have used my external drives on other devices and they worked well without any hassle or power surges. I have now come to the conclusion that the Xavier is the culprit.

From what I can see in previous posts USB to the external drive is giving errors. You have tested external power to the drive, and so it is unlikely power delivery is the issue. Using “dmesg --follow”, can you verify that with external power to the drive you still get those same USB errors? I’m guessing you do, but want to verify.

One other possibility is running out of physical RAM. This could would not necessarily mean USB is not an issue, but this would cause a sudden reboot or other failures. You may want to monitor RAM use and see what it appears as at the moment of failure. I’ll suggest installing “htop” (“sudo apt-get install htop”), and monitoring that via serial console. At the moment of failure you should basically be able to see some information on memory use.

If this does not indicate anything new, then someone else will need to find out the reason for the USB errors. One clue which might help is knowing if the actual error changes any for a directly connected USB SSD, versus indirectly attached via the powered HUB.