AGX xavier dev kit faulty USB-C cannot flash

I will try the 3p server thing as soon as I can.

The dmesg does detect something on the j512 port

when the device is put into recovery mode.

How can I talk to the 3p server to flash the device and bypass the faulty j512 port? Is this even possible?

Is the J512 you are speaking of the USB connector on the right side of the 40-pin header? If your host PC does not see anything when the Jetson is in recovery mode, upon cable insert, then there is a hardware issue. The exception would be a VM or container being misconfigured. If this is straight to another Linux computer and plugging that in does not show up on “dmesg --follow”, then we can call that hardware failure.

The 3p server is a bit of an undocumented black box which NVIDIA has not released. It is what runs when the Jetson is in recovery mode. You don’t talk to it per se, but the flash software does. That talking is via the USB connector on the right of the 40-pin header. Any time the Jetson is in recovery mode the 3p server is available.

Yes it’s the one marked ‘flash/debug’, and it is the one that seems to be the issue. The image I sent previously confirms that the host PC does register activity when the device is connected in recovery mode.

Is it almost certain, based on everything so far, that it is a hardware issue? How much of the hardware could be at fault besides the port?

The above would suggest the hardware works. The port itself is not defective in this case, and there is no bypass for flashing. Any activity implies that the PC does see activity in recovery mode. By itself detecting a recovery mode Jetson by dmesg is the same as verifying the USB part of this functions. USB does not know, nor does it care what happens with the data it transfers (this is up to the hardware and software the USB talks to at each end).

Are you sure there is a broken solder connection? Is there any kind of VM or container being used with the flash software? Is absolute all flash content running natively on the host PC directly in Ubuntu?

I do not know if there is a broken connection, the traces are so tiny I cannot tell.

But you have my pinky promise that there is no VM or container running and no shenanigans beyond the OS version tweak (which was not an issue for other identical devices). Flashing is attempted either by a natively installed SDK manager or by the script as.

The thread is getting long so forgive me if I restate some things, but double check what I’m summarizing. Here is what we actually know (sorry for this long summary, hoping it helps to show history and logs in one place; maybe @DavidDDD or someone else who knows more about the “tegrarcm_v2 --isapplet” mentioned near the end can answer if this is really a hardware failure; numbering bullet points):

  1. USB itself is verified. There is dmesg activity upon insert of USB cable from Jetson to host PC when the Jetson is in recovery mode.
  2. Host PC side flash software is not detecting the Jetson, suggesting (in conflict with the first statement) that USB does not work. In a log:
[   2.1157 ] Applet version 01.00.0000
[   3.1236 ] tegrarcm_v2 --isapplet
[   3.1242 ] USB communication failed.Check if device is in recovery
[   3.3738 ] tegrarcm_v2 --ismb2

This latter log means the the flash software cannot talk to the 3p server, or settings are so wrong the USB is being ignored thinking that the hardware is not the hardware specified for flash (USB can be correctly functioning if the message is simply because the software is wanting different hardware; I cannot differentiate between a hardware failure and a hardware identification problem).

  1. Boot log from serial console indicates the unit probably worked at some point, or at least some flashed content was put on the unit. However, lots of errors are visible in serial console boot log. This is expected if a flash did not complete, but the i2c errors and EEPROM errors suggest it is possible that there is something going wrong in hardware (or firmware). It isn’t definitive.
  2. Serial console while in recovery mode, “log1.txt”, https://forums.developer.nvidia.com/uploads/short-url/6nwiziCFWeoIToUvIHBzSnxAkfx.txt, occurs while there is an attempt to flash. USB did manage to download mb2 image, and says MB1 done. Shortly after this we see this excerpt:
[0023.101] I> sdmmc DDR50 mode
[0023.106] E> SPI_FLASH: Invalid value device id: 7.

(which seems like the host PC flash software is starting to either test or fail; it is normal to test and see a failure if the test is for the presence of hardware and the hardware is not present)
5. A successful 3p server and communications with the Jetson does show:

[0023.137] I> Found 17 partitions in SDMMC_BOOT (instance 3)
[0023.146] I> Found 42 partitions in SDMMC_USER (instance 3)
[0023.151] W> Profiler not initialized
[0023.155] I> Entering 3p server
[0023.158] I> USB configuration success

It really seems that the device is found. It seems the USB is talking to the recovery mode hardware (“Entering 3p server” makes me believe this).

  1. After the 3p server is entered nothing happens. From the Jetson side this would be true if the host side does not issue further flash commands. The host side thinks the previous success with USB is now gone and USB no longer functions so far as the host PC side can tell. The contradictions, starting with flash log “tegrarcm_v2 --isapplet” responding “USB communication failed.Check if device is in recovery” (see log_flash.txt) is the root of all failure here. This is why it sounds like a USB failure, and it could be a signal quality failure or some more permanent error, but I think this can also be from some form of having hardware not being what the flash software is expecting. I just don’ t know because I do not know any of the black box code which we should see messages from once the “tegrarcm_v2 --isapplet” begins. I’d really like to know what the tegrarcm_v2 --isapplet is doing, and what could cause this to make the flash software at the host side consider this failing. One of my questions is if the hardware being different than expected is a cause, but I just don’t know. It really could be a hardware failure under these circumstances, but in the past this is usually caused by a USB cable or a VM, and we know it isn’t a VM, and the USB cable probably is ok (it isn’t guaranteed when it comes to signal quality).

The summary is very valuable and thank you very much for taking the time to do it. Replies individually.

  1. Yes there is dmesg activity on host PC when the Jetson is connected in recovery mode. I guess this means that the flashing USB C connection (J512) works at least initially.

  2. The host PC flash software (either flash.sh or SDK manager) is detecting the Jetson in recovery mode initially, but once the Jetson is disconnected for the first time by the flashing software it apparently does not ever reconnect again by itself. It is from this point on that the ‘USB communication failed’ messages appear in the logs.

  3. Yes, the previous owner’s (password-protected) build is still on the eMMC.

  4. I guess I agree.

  5. Serial USB output with gtkterm on host PC does indeed show ‘entering 3p server’ when the Jetson is connected in recovery mode and when flashing is initated (either by flash.sh or the SDK manager) but it goes no further.

  6. Yes, I think you are right. Once the initial USB connection is lost and not automatically regained, the 3p server receives no further commands and flashing fails (it makes me wonder what the host PC thinks it is flashing up to 99%). I have tried a number of cables now, including bona-fide USB 2.0 cables, and routing through USB hubs, and while it has had an effect on the percentage attained before failure it has evidently not solved the issue. The cables and host PC setup have successfully flashed other AGX Jetson Dev Kits without problem.

I understand that despite our (mainly yours!) best efforts we have identified a likely hardware root cause, but unfortunately nothing approaching a resolution as of yet. Oh dear :/.

Correct.

Correct. Most of the time this kind of failure is either from using a misconfigured VM or a poor quality cable. The older micro-OTG USB socket accepts either a type-A cable plug or a type-B cable plug, but has an ID pin to tell the host which type it is. This is often sold (for the older USB2 cable types) as a “charger cable” (it is a type-B at the end on the Jetson for that purpose). About 75% of all “charger cables” are incapable of sustained data transfer. However, if your flash port is using a type-C (USB3) connector, then those are very reliable and not normally an issue for signal quality failure. I do not expect a signal quality issue for the Jetson models which use USB-C for the flash port.

If the old owner’s content is visible when we use eMMC (when we simply remove the m.2 drive), then it means the boot content was probably originally for eMMC and not for the m.2. This is the most basic flash target and if we correctly flash then this previous boot content and previous Linux install would be gone and would not show up. This tends to suggest that none of your flash efforts have altered (A) the boot chain, or (B) the actual o/s partition. Recovery mode, in and of itself, does not alter a Jetson in any sustained way. Reboot after recovery mode means no change if flash did not flash something; recovery mode is a mode and not a change…only the flash operation itself will alter what is in the boot chain or the o/s partition. The indications are that no flash from your efforts has changed anything (otherwise the eMMC would have broken the original owner’s content).

The AGX Xavier uses a type-C (USB3) connector, and so the cable itself is not likely an issue. We do still see issues sometimes with particular ports on a host PC (usually a laptop), and I wouldn’t call that “rare”, but it is “close to rare”. You could try different host ports, but I don’t expect that to change things unless this is the 1% case where it matters.

I wouldn’t call it a lost cause, there are a few things which might be a blocking detail, but assuming the procedure is correct then it is more likely a hardware failure. If any flash procedure alters boot in any way (and you’ll need to watch the serial console both during boot and flash to see activity), then it means it works. If you can keep trying and do anything which breaks the existing boot chain or erases the original installation, then it means that flash can succeed. Basically, something needs to change to allow writing of the eMMC. When writing succeeds in any way it implies that full flash can succeed.