Jetson TX-2 fail to flash; Ubuntu 16; Jetpack 3.1, 3.2

This is the last line of “dmesg --follow”:
[10532.499293] usb-storage 1-2:1.0: USB Mass Storage device detected
[10532.499425] scsi host2: usb-storage 1-2:1.0
[10533.527164] scsi 2:0:0:0: Direct-Access kingston DT101 G2 PMAP PQ: 0 ANSI: 4
[10533.527502] sd 2:0:0:0: Attached scsi generic sg0 type 0
[10535.582495] sd 2:0:0:0: [sda] 15577088 512-byte logical blocks: (7.98 GB/7.43 GiB)
[10535.582708] sd 2:0:0:0: [sda] Write Protect is off
[10535.582710] sd 2:0:0:0: [sda] Mode Sense: 23 00 00 00
[10535.582904] sd 2:0:0:0: [sda] No Caching mode page found
[10535.582906] sd 2:0:0:0: [sda] Assuming drive cache: write through

[10535.617455] sda: sda1
[10535.619093] sd 2:0:0:0: [sda] Attached SCSI removable disk
[10536.038062] FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.

The whole output of “dmesg --follow” I put in a file if which is helpful I will copy here (it seems cannot attach a file here). The two bold line is red.

It is odd to see “sda” for the USB thumb drive. Does your laptop have entirely NVMe and no regular hard drive? This would explain it. I’ll assume “sda” really is the thumb drive.

Sort of off topic, I don’t know what you use that thumb drive for, but it wasn’t umounted before removing it (if in Windows this would be “eject safely”…an icon in the system tray is used for that). Windows would “repair” the file system on that thumb when connecting…on Linux you could run “sudo fsck.vfat /dev/sda” to do the equivalent. Since you’re not mounting this for any kind of write it won’t matter, but did you get any “dmesg” output at all while running this?

cat /dev/sd<b>a</b> > /dev/null

What you see at the moment the USB stick is connected to the host isn’t particularly useful other than identifying what the device is you’re testing with. The real trick is to see if the stick is listed as “480M” on the right side of its listing in “lsusb -t”, then doing that “cat /dev/sda > /dev/null” to test while monitoring dmesg and seeing if either there is a “dmesg” error listed or if “lsusb -t” reverts from “480M” to something like “1.5M” or “12M” (versus “480M”).

Thanks linuxdev for the step by step guidance.
I do the fsck to repair the thumb disk, now there is no error of unmount. The output are almost the same.

With “lsusb -t”, the thumb disk line:
Port 2: Dev 25, If 0, Class=Mass Storage, Driver=usb-storage, 480M
never changes.

With “dmesg --follo”, the thumb disk lines:
[47224.799984] usb 1-2: New USB device found, idVendor=0000, idProduct=0000
[47224.799988] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[47224.799990] usb 1-2: Product: DT101 G2
[47224.799992] usb 1-2: Manufacturer: kingston
[47224.799994] usb 1-2: SerialNumber: 90C60A0026500C52
[47224.800413] usb-storage 1-2:1.0: USB Mass Storage device detected
[47224.800525] scsi host5: usb-storage 1-2:1.0
[47225.829662] scsi 5:0:0:0: Direct-Access kingston DT101 G2 PMAP PQ: 0 ANSI: 4
[47225.831284] sd 5:0:0:0: Attached scsi generic sg0 type 0
[47228.033126] sd 5:0:0:0: [sda] 15577088 512-byte logical blocks: (7.98 GB/7.43 GiB)
[47228.033293] sd 5:0:0:0: [sda] Write Protect is off
[47228.033296] sd 5:0:0:0: [sda] Mode Sense: 23 00 00 00
[47228.033489] sd 5:0:0:0: [sda] No Caching mode page found
[47228.033491] sd 5:0:0:0: [sda] Assuming drive cache: write through
[47228.067160] sda: sda1
[47228.068907] sd 5:0:0:0: [sda] Attached SCSI removable disk

With “cat /dev/sda > /dev/null”:
Never any output, and running this doesn’t cause any change in the above two windows. It seems my USB port doesn’t decrease to lower speed from 480M.

I am curious…you have two root_hub listings, the other port is USB3 compatible. Nothing is connected to this other port. Is that something you could try, or is this port not accessible? I’m also wondering, during a flash, is the Jetson the only device on the current HUB? I am thinking the USB3 port could have better wiring so far as signal quality goes, and also that interactions of multiple devices on a single root_hub could be getting in the way. Anything to move the Jetson to its own port would be a good debug step, as well as moving the Jetson to a different port.

I have type C USB, that should be USB3. I will try it tomorrow. I just remember I seems had ever try that, no success. As I don’t have a USB2 micro B to type C cable, so I use a converter from type C to standard A.

If you use a C to standard A and that connects to a HUB, then the original micro-B to A should work, though normally I tell people to try with no HUB (and then if it fails try with a HUB). USB3 will of course have to drop back to USB2 (which is how it would normally work), and I’m hoping perhaps differences in signal quality to the other root_hub will help.

I figure out all my USB ports are USB 3. I check back to the spec of USBs on my laptop: one Thunderbolt 3 (USB Type-C) port (USB 3.1 Gen 2), one USB 3.0 port, one USB 3.0 port with PowerShare.

I test with the Type-C USB port, still the same. The “lsusb -t” shows:
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 002: ID 0955:7c18 NVIDIA Corp.
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 005: ID 0c45:670c Microdia
Bus 001 Device 004: ID 04f3:2234 Elan Microelectronics Corp.
Bus 001 Device 003: ID 0cf3:e301 Qualcomm Atheros Communications
Bus 001 Device 002: ID 046d:c52f Logitech, Inc. Unifying Receiver
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

I think Bus 003/004 are my Type-C hub, and TX2 is on Bus 003. I have the sense that it is not because of the USB port.

I have two concerns:

  1. Installation of Jetback: I have problem to install CUDA, it always fails when arriving at installation of CUDA, so I remove it from installation. I think it should not cause the problem.
  2. My laptop has problem of shutdown: sometime when shutdown or restart(esp. when connect a monitor or Ethernet through Type-C USB port), the system will stuck at unmount file system.

Here are some error, fail or in read from dmesg:
[ 9.225037] EXT4-fs (nvme0n1p3): re-mounted. Opts: errors=remount-ro
[ 9.325951] int3403 thermal: probe of INT3403:03 failed with error -22
[ 9.300348] intel_hid: module verification failed: signature and/or required key missing - tainting kernel
[ 14.352484] ath10k_pci 0000:3a:00.0: Direct firmware load for ath10k/cal-pci-0000:3a:00.0.bin failed with error -2
[ 14.352710] ath10k_pci 0000:3a:00.0: Direct firmware load for ath10k/QCA6174/hw3.0/firmware-5.bin failed with error -2
[ 14.352713] ath10k_pci 0000:3a:00.0: could not fetch firmware file ‘ath10k/QCA6174/hw3.0/firmware-5.bin’: -2
[ 56.074838] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074839] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074840] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074840] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074842] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074844] mce: [Hardware Error]: Machine check events logged
[ 56.074846] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 56.074848] mce: [Hardware Error]: Machine check events logged

[ 56.076873] CPU3: Core temperature/speed normal
[ 56.076874] CPU1: Core temperature/speed normal
[ 56.076874] CPU0: Package temperature/speed normal
[ 56.076875] CPU2: Package temperature/speed normal
[ 56.076876] CPU1: Package temperature/speed normal
[ 56.076879] CPU3: Package temperature/speed normal

The previous “lsusb -t” disagrees about it all being USB3…the port everything has been on is not USB3:

/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/12p, <b>480M</b>

…480M is USB2.

Now it is possible the port is capable of USB3, but using a USB2 driver. If that port is using the right driver and is truly capable of USB3, then it means the port has itself determined there is a signal quality issue and is refusing to operate at USB3 speeds. If you are absolutely certain that port really is USB3, then you’ve just found evidence that the host software has decided the port itself has a signal quality issue and has dropped back to slower operation.

I didn’t see Bus 3 or 4 listed. Is there an “lsusb -t” reading showing those buses? You mentioned thunderbolt, but this is not USB so I would be very surprised (and call it a bug) if thunderbolt showed up in “lsusb”.

You are correct that CUDA has no impact on the flash stage. All flash is done prior to any additional software installs, and the Jetson will have rebooted itself if flash had occurred. On the other hand CUDA is mandatory for all of the other optional packages…without CUDA you can’t install other CUDA-dependent software.

If there was reason for the CPUs to be hot, then it isn’t an error for throttling back. If the CPU cores get hot when they should not, then perhaps there is an issue (e.g., sometimes old thermal paste no longer transfers heat correctly or a fan dies or dust covers parts and insulates them). I’m not sure if this would have any effect on the flash or not.

I see WiFi involved, and so far as networking goes you can’t normally use WiFi between host and Jetson. It is ok to have WiFi between host and the internet, but the Jetson needs wired ethernet for extra software install steps. Flash has no network requirement at all and won’t care.

This might be of interest:

[ 56.074844] mce: [Hardware Error]: Machine check events logged

There is no way to guess at what errors this refers to. There is probably a way to turn up kernel logging details and get more information, but I’m not sure what to do to get that log information. It could be a note about USB hardware errors, I don’t know. Earlier Atheros 10k firmware failure could be causing this, but probably not since the ath10_firmware load failure occurred long before the MCE (machine check exception). There is some information on MCE here, though it probably won’t help you much:
https://en.wikipedia.org/wiki/Machine-check_exception

Overheating is a big cause of MCE failures, e.g., a PC with an inadequately cooled CPU and GPU suddenly being stressed by gaming and heating faster than the CPU can throttle back would be a common example. Your overheat message and MCE occurred together, and odds are heating caused the MCE. It is enough of a notice that I wouldn’t trust the computer.

Hello linuxdev, I don’t know how to say thanks for you trying to help me with so much details.

Bus 003 and Bus 004 are the bold lines in my last post.

The following are the red lines from “journalctl -b0” (the log file has almost 6000 lines, seems I cannot attach a file here, I suspect the bold lines may cause problems):

Jul 15 15:17:12 wt70707xps systemd-udevd[319]: invalid key/value pair in file /etc/udev/rules.d/dji-usb.rules on line 3, starting at character 1 (‘$’)
Jul 15 15:17:12 wt70707xps systemd-udevd[319]: invalid key/value pair in file /etc/udev/rules.d/dji-usb.rules on line 5, starting at character 1 (‘r’)

Jul 15 15:17:12 wt70707xps systemd-udevd[397]: Error running install command for pinctrl_intel

Jul 15 15:17:12 wt70707xps systemd-udevd[389]: Error running install command for pinctrl_intel

Jul 15 15:17:13 wt70707xps smartd[1008]: Problem creating device name scan list
Jul 15 15:17:13 wt70707xps smartd[1008]: In the system’s table of devices NO devices found to scan

Jul 15 15:17:13 wt70707xps NetworkManager[931]: nm_device_get_device_type: assertion ‘NM_IS_DEVICE (self)’ failed

Jul 15 15:17:13 wt70707xps bluetoothd[955]: Failed to obtain handles for “Service Changed” characteristic
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Error adding Link Loss service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Current Time Service could not be registered
Jul 15 15:17:13 wt70707xps bluetoothd[955]: gatt-time-server: Input/output error (5)
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Not enough free handles to register service
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Sap driver initialization failed.
Jul 15 15:17:13 wt70707xps bluetoothd[955]: sap-server: Operation not permitted (1)
Jul 15 15:17:13 wt70707xps bluetoothd[955]: Failed to set mode: Blocked through rfkill (0x12)

Jul 15 15:17:14 wt70707xps lightdm[1452]: PAM unable to dlopen(pam_kwallet.so): /lib/security/pam_kwallet.so: cannot open shared object file: No such file or direct
Jul 15 15:17:14 wt70707xps lightdm[1452]: PAM adding faulty module: pam_kwallet.so
Jul 15 15:17:14 wt70707xps lightdm[1452]: PAM unable to dlopen(pam_kwallet5.so): /lib/security/pam_kwallet5.so: cannot open shared object file: No such file or dire
Jul 15 15:17:14 wt70707xps lightdm[1452]: PAM adding faulty module: pam_kwallet5.so

Jul 15 15:17:14 wt70707xps lightdm[1531]: PAM unable to dlopen(pam_kwallet.so): /lib/security/pam_kwallet.so: cannot open shared object file: No such file or direct
Jul 15 15:17:14 wt70707xps lightdm[1531]: PAM adding faulty module: pam_kwallet.so
Jul 15 15:17:14 wt70707xps lightdm[1531]: PAM unable to dlopen(pam_kwallet5.so): /lib/security/pam_kwallet5.so: cannot open shared object file: No such file or dire
Jul 15 15:17:14 wt70707xps lightdm[1531]: PAM adding faulty module: pam_kwallet5.so

Jul 15 15:17:15 wt70707xps systemd[1]: Failed to start User Manager for UID 0.
Jul 15 15:17:15 wt70707xps su[1664]: pam_systemd(su:session): Failed to create session: Start job for unit user@0.service failed with ‘failed’

Jul 15 15:17:15 wt70707xps pulseaudio[1619]: [pulseaudio] backend-ofono.c: Failed to register as a handsfree audio agent with ofono: org.freedesktop.DBus.Error.Serv

Jul 15 15:17:17 wt70707xps kernel: ath10k_pci 0000:3a:00.0: could not fetch firmware file ‘ath10k/QCA6174/hw3.0/firmware-5.bin’: -2

Jul 15 15:17:22 wt70707xps wpa_supplicant[2040]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Jul 15 15:17:22 wt70707xps wpa_supplicant[2040]: dbus: Failed to construct signal
Jul 15 15:17:22 wt70707xps wpa_supplicant[2040]: Could not read interface p2p-dev-wlp58s0 flags: No such device

Jul 15 15:17:42 wt70707xps bluetoothd[955]: RFCOMM server failed for Headset Voice gateway: rfcomm_bind: Address already in use (98)
Jul 15 15:17:42 wt70707xps pulseaudio[2591]: [pulseaudio] backend-ofono.c: Failed to register as a handsfree audio agent with ofono: org.freedesktop.DBus.Error.Serv
Jul 15 15:18:14 wt70707xps kernel: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 15 15:18:14 wt70707xps kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 15 15:18:14 wt70707xps kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 15 15:18:14 wt70707xps kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 15 15:18:14 wt70707xps kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 15 15:18:14 wt70707xps kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)

journalctl.txt (729 KB)

You can attach files. The trick is that the post must already exist…the attach is sort like an edit. If you hover your mouse over the top right “quote” icon of your post you’ll see a paper clip icon show up. Click on the paper clip icon to attach a file. Some restrictions exist in file name, so for example you might rename a log with “.txt”. Pasting is also ok, but you’ll want to use the “code” icon (looks like “</>”) to give it scroll bars and preserve indentation.

The bluetooth stuff won’t have any effect on wired ethernet, nor on the USB port unless there is some defect and the bluetooth is directly wired to the USB port. Could be, but not likely.

The “kwallet” (seems you are running KDE) is a background password tool for assisting in such things as opening up chromium-browser and having it sync to google without a million passwords. This won’t have any effect on flashing, and for those things which do depend on this, then it’ll just mean it manually asks for a password. It isn’t a “show stopper”.

What is curious is what might be causing the temperature rise. Bad firmware and driver state could cause drawing more power, but more often than not this is a hardware issue instead of a software issue. Don’t know how old your laptop is, but at about 4 or 5 years some heat sink compounds harden and lose the ability to transfer heat. I’ve seen this before and these days I won’t use any compound other than a “long life” one. There is also a carbon fiber heat transfer pad I find interesting, but I have not yet tried that out.

This still does not answer the question as to whether the port you have been using is intended to be USB3 or not. If this port really is USB3 and not USB2 (and if the USB3 driver is in place), then it is a “smoking gun” evidence something is wrong with signal quality and the hardware is in some way failing.

I got my laptop last September, it is still in warranty time. It is a DELL XPS 13 9360.

I guess the standard type A USB 3.0 ports are USB 2.0 compatible, and there is a USB 3.0 controller (Bus 002) and a USB 2.0 controller (Bus 001). This is why there are two root-hub when I didn’t plug in any USB hub. But 003/004 are added when I plug in a USB Type-C hub which also has a USB 3.x controller and a USB 2.0 controller.

The journalctl log file is attached to the last post.

I have no way to spot the exact cause, but it does seem that other systems will flash this exact Jetson. Having issues where you get hardware errors which are vague, but hinting at overheating, makes me wonder what is going on. This could be your Ubuntu install, but you mentioned using someone else’s laptop and that laptop working. Are they the same model of laptop? Are they both running Ubuntu 16.04?

If you really really wanted to know what is going on you’d need to put a USB2 protocol analyzer on the line. There are some software-only debug tools, but those tools won’t tell you anything about what the PHY is doing (and I think the issue revolves around signal quality/PHY).

Had some issues when I tried to flash my Xavier. It shows up as an USB device using lsusb, everything looks OK. But I couldn’t flash. But then I changed to another USB port on my host computer (laptop). I use one with a reguler USB symbol, not “SS” + USB-symbol. This solved my headache!
BR Andreas