Jetson Nano production module does not boot on custom carrier board, but does so on auvidea's

Hi,

I am facing this issue that after flashing some of the Jetson Nano production modules that we have, they do not boot on the custom carrier boards we have for flashing the Nanos. (Please note that these boards do not have HDMI connection, or means to check the serial logs during boot).

Interestingly, the same flashed unit boots on another carrier board from a company called auvidea.

This problem does not occur consistently, only on some of the units that we have, and I use the same image to flash all the units.

Any leads on how to debug this? I suspect something might have gone wrong with system image creation that happens after mounting the Nano to be flashed as ntfs (a guess, not sure about this). I use the following command to flash:

./flash.sh jetson-nano-emmc mmcblk0p1

Is there a way I could completely force erase the emmc on Nano? Or is there a chance the units are bricked?

I even tried cleaning up the existing system images and boot images, and allowed the flash script to generate new ones (by not passing the -r option to flash script), still the same problem.

Hi jetson_user,

It is hard to tell what is going on if no means to check serial logs.

To debug flash problem, generally we need logs from both host and devices.

That is, we need the log from flash.sh and it would show on your host.
Also, we need the serial log from device and shows up by minicom on host. This part is important because flash script normally returns something common. We cannot get detailed log by just reading the error from flash.sh.

You could at least share the error from flash.sh first. If we cannot tell what is going on, then maybe you need to find a way to dump serial log.

Hi WayneWWW,

Thank you for your response.

The flash script does not report any error. I flashes 100% and also also attempts a cold boot in the end. It’s just that the Nano does not boot. So is there a way to make sure the emmc is completely erased?

Hi jetson_user,

If no log is shown here, you must have something to tell this device “does not boot”. May I know how do you know it does not boot? Do you see the LED of power indicator gone or no monitor?

For your case, I guess an alternative is put this flashed module back to any carrier board that is able to dump serial log or able to boot into system.

If it is good on other carrier board and verified this is indeed a new rootfs, then I think it is successfully flashed. In such case, you may need to review the hardware design of your carrier board.

Hi WayneWWW,

Thank you for your response.

By device does not boot, I mean I do see the LED indicating that the device is powered up, but it does not show up on the network. There’s no HDMI, so I would not know if it is stuck somewhere. The Nano has been assigned a static IP via installing the network interface in the rootfs before flashing. And the Nano is also connected to the host locally, so router does not come into picture. And I can login in the same Nano when it boots from an Auvidea board, using the same known IP address.

Which logs would be helpful in this case? I went through the syslog after the Nano boots from an Auvidea board and I am not sure if I found anything convincing. Besides I assume these are the logs from the success case.

Is there a way I can erase the flash completely so that these modules might be reusable? Like the units shipped from the factory? Or is it the characteristic of some of the units that they cannot boot on of set of carrier boards (though this seems highly unlikely). Also, does the host machine have anything to do in this case? We use Intel NUC as the host with i3 processor. Strangely we do not have the issue on standard PCs with i5 processors, but we have not run the flashing on these that many number of times to say anything for sure.

  1. Full flash would be sufficient and equal to “erase totally”.

  2. You could try to do below process to verify your flash.

Flash the board and see if it can work on your custom carrier board. If not work, plug out this module and put it back directly to Auvidea carrier board and boot up, if it can boot successfully, then it means the flash is fine.

There should be no case as “flash process says success but actually no flash happens”.

If flash successfully but you cannot see monitor or ip addr on your board, it is possible that

  • system hangs somewhere during boot
  • -> Cannot help unless get serial console log.
  • some kernel modules are not loaded
  • -> you could check what is your "uname -r" and the file name under /lib/modules/. These two should match each other or some driver may not be loaded correctly.

    Hi WayneWWW,

    I am still trying without any luck yet to obtain the serial logs at startup. Also, in the “no boot” case, I observe the syslog is not filled on powering on the non-bootable system, when checked on a bootable system later.

    When I flash with the standard Jetpack using the SDK manager, it boots fine, even on our custom carrier boards.

    When I flash using the flash script, the problem I mentioned happens, but only on a few units. It seems there seems to be a hardware dependency on this, as if some Nano module is not suited for flashing by flash script, and when our custom carrier board is used, though I have not been able to crack this relation yet.

    I also verified that the output of “uname -r” (4.9.140) matches with the directory listed under /lib/modules (4.9.140/ and 4.9.140-tegra/).

    Meanwhile I am trying at my end to get as much more information as possible.

    Hi WayneWWW,

    I still do not have a mechanism to retrieve the serial logs and I understand it is difficult to help out without that.

    I still have a question: Is there a difference in the way the SDK manager flashes the modules and the flash.sh script does (./flash.sh jetson-nano-emmc mmcblk0p1) ?

    Hi jetson_user,

    Sorry for late reply.

    I still have a question: Is there a difference in the way the SDK manager flashes the modules and the flash.sh script does (./flash.sh jetson-nano-emmc mmcblk0p1) ?

    Basically, they are same. I am not sure whether sdkmanger would download the package every time or not. If it does not, then they are totally same.

    Hi, this looks like PCB quality issue of some boards. Did you check and compare the components placement and soldering quality related things between issue board and good board?

    Hi,

    Thank you for the responses.

    Hi Trumany,

    I do not know how to exactly check these, but the same carrier boards with the issue, can boot other (fresh Nanos) that are flashed with a clean image (i.e., not the image which resulted in a boot issue).

    Hi jetson_user,

    I think you better clarifying your case again since I cannot get what the exact scenario that would hit error.

    We would like to know whether it is a hardware design problem on carrier board or software driver issue.

    Please tell us

    From #11

    I do not know how to exactly check these, but the same carrier boards with the issue, can boot other (fresh Nanos) that are flashed with a clean image

    Wayne: Do you mean this issue only happens to your customized image? It does not matter about the carrier board?

    From #7

    When I flash using the flash script, the problem I mentioned happens, but only on a few units. It seems there seems to be a hardware dependency on this, as if some Nano module is not suited for flashing by flash script, and when our custom carrier board is used, though I have not been able to crack this relation yet.

    Wayne: It seems issue is related to carrier board again. Which one is true here?

    Hi WayneWWW,

    I understand the issue is confusing and even we are baffled. We do not know at this point for certain whether it is a hardware issue or a software one, or a combination.

    Wayne: Do you mean this issue only happens to your customized image? It does not matter about the carrier board?

    Yes, the issue happens only with our customized image, and when our carrier board is used to flash the Nano. Suppose the Nano is flashed like this and does not boot on our carrier boards, and if we use the same Nano which was not booting on our carrier board on Auvidea’s carrier board, it boots, but we do not intend to use Auvidea carrier boards in field at least as of now.
    When I say ‘clean’ image, I mean to say I remove the system.img and system.img.raw files, and run the flash scripts so it generates the ‘new/clean’ image. But each time, it is with our customized kernel and device tree.

    Wayne: It seems issue is related to carrier board again. Which one is true here?

    As discussed above, we do not know for certain if it is the carrier board issue. It could be that something goes wrong with the electronics while flashing, rendering the Nanos non-bootable (and also I was wondering if you could help with indicating if I could debug something into the BSP). The main problem is that if I encounter such a situation when a Nano does not boot on our carrier boards (after flashing on our carrier boards) it is not recoverable, which means no matter how many times I flash it and with a ‘clean’ image as discussed above, or even re-compiling the entire kernel+device tree before, the Nano does not boot on our carrier boards but does so on Auvidea’s.

    This is the “relation” I was refering to in #12 - somehow only some Nanos are not suitable for our carrier boards (we always use the same image), and cannot be made so.

    There are multiple such units till now. So what we observe is if we flash a unit on our carrier boards which somehow makes it non-bootable on our carrier board, then the Nano is non-reclaimable as I explained, and also we have to allow the flash script to generate a new system.img* (by not passing the ‘-r’ option) and deleting the old system.img*, when we move on to flashing the fresh Nano unit. Otherwise we create another non-reclaimable unit.

    So, our focus at the moment is to somehow reclaim those units, and prevent such an occurrence in the future. What we now do is create a new system.img* each time for flashing a Nano, and deleting the old image once it is done. This process is slow, but somehow safe till now. We have to flash a large number of Nanos and possibly in an automated way in the future (but right now this isn’t the problem I intend to put forth). Basically, we somehow have to prevent producing a non-bootable Nano because there are high chances that if we hit one such unit, we create other non-reclaimable units.

    Let me know if this information helps further.

    Hi jetson_user,

    I got it. So clean image indicates a new systemimage created without “-r” parameters in flash.sh.

    somehow only some Nanos are not suitable for our carrier boards (we always use the same image)
    Do you mean there are some nano that can flash and boot up successfully on your carrier board?

    To be honest, if this commit is true, I don’t think this issue could be resolved from software especially under a case that doesn’t have uart console log.

    Do you mean there are some nano that can flash and boot up successfully on your carrier board?

    All Nanos can be flashed by our carrier boards. Some Nanos (the number is significant, like 6 in 50) just don’t boot after flashing on our carrier boards. But most of them do so successfully. We need some way to re-claim the non-bootable ones as they boot on auvidea boards but not on our boards. So the idea is to look into what’s going wrong with flashing and ensuring creation of an image that results in successful boot each time on our carrier boards, and also checking if hardware can be an issue.

    And we use our same custom image each time.

    Yes, I understand that without console logs it is difficult. Meanwhile we are looking into obtaining those as well.

    So the idea is to look into what’s going wrong with flashing and ensuring creation of an image that >results in successful boot each time on our carrier boards, and also checking if hardware can be an >issue.

    I don’t think checking the flash process would help for them. The flash process would only read the eeprom value from each module and if eeprom value is not valid, the flash process would fail. It won’t not affect the systemimage.

    Hi WayneWWW,

    I think we are facing a similar problem. We have an custom Board as well. It works fine with the development Module but does not boot with the Production Module. The Production Module we used does boot on a carrier board from Auvidea.

    Please take a look at the UART output:

    [0000.321] [L4T TegraBoot] (version 00.00.2018.01-l4t-80a468da)
    [0000.327] Processing in cold boot mode Bootloader 2
    [0000.331] A02 Bootrom Patch rev = 1023
    [0000.335] Power-up reason: pmc por
    [0000.338] No Battery Present
    [0000.341] pmic max77620 reset reason
    [0000.344] pmic max77620 NVERC : 0x40
    [0000.347] RamCode = 0
    [0000.350] Platform has DDR4 type RAM
    [0000.353] max77620 disabling SD1 Remote Sense
    [0000.357] Setting DDR voltage to 1125mv
    [0000.361] Serial Number of Pmic Max77663: 0xa12ca
    [0000.369] Entering ramdump check
    [0000.372] Get RamDumpCarveOut = 0x0
    [0000.375] RamDumpCarveOut=0x0,  RamDumperFlag=0xe59ff3f8
    [0000.380] Last reboot was clean, booting normally!
    [0000.385] Sdram initialization is successful 
    [0000.389] SecureOs Carveout Base=0x00000000ff800000 Size=0x00800000
    [0000.395] Lp0 Carveout Base=0x00000000ff780000 Size=0x00001000
    [0000.401] BpmpFw Carveout Base=0x00000000ff700000 Size=0x00080000
    [0000.407] GSC1 Carveout Base=0x00000000ff600000 Size=0x00100000
    [0000.413] GSC2 Carveout Base=0x00000000ff500000 Size=0x00100000
    [0000.418] GSC4 Carveout Base=0x00000000ff400000 Size=0x00100000
    [0000.424] GSC5 Carveout Base=0x00000000ff300000 Size=0x00100000
    [0000.430] GSC3 Carveout Base=0x000000017f300000 Size=0x00d00000
    [0000.446] RamDump Carveout Base=0x00000000ff280000 Size=0x00080000
    [0000.452] Platform-DebugCarveout: 0
    [0000.456] Nck Carveout Base=0x00000000ff080000 Size=0x00200000
    [0000.461] Non secure mode, and RB not enabled.
    [0000.478] Csd NumOfBlocks=0
    

    I think the Problem is the last line because nothing was found:
    [0000.478] Csd NumOfBlocks=0

    UART Output of development Module that boots normal:

    [0000.125] [L4T TegraBoot] (version 00.00.2018.01-l4t-80a468da)
    [0000.131] Processing in cold boot mode Bootloader 2
    [0000.135] A02 Bootrom Patch rev = 1023
    [0000.139] Power-up reason: pmc por
    [0000.142] No Battery Present
    [0000.145] pmic max77620 reset reason
    [0000.148] pmic max77620 NVERC : 0x40
    [0000.151] RamCode = 0
    [0000.154] Platform has DDR4 type RAM
    [0000.157] max77620 disabling SD1 Remote Sense
    [0000.161] Setting DDR voltage to 1125mv
    [0000.165] Serial Number of Pmic Max77663: 0x2b31ed
    [0000.173] Entering ramdump check
    [0000.176] Get RamDumpCarveOut = 0x0
    [0000.179] RamDumpCarveOut=0x0,  RamDumperFlag=0xe59ff3f8
    [0000.184] Last reboot was clean, booting normally!
    [0000.189] Sdram initialization is successful 
    [0000.193] SecureOs Carveout Base=0x00000000ff800000 Size=0x00800000
    [0000.199] Lp0 Carveout Base=0x00000000ff780000 Size=0x00001000
    [0000.205] BpmpFw Carveout Base=0x00000000ff700000 Size=0x00080000
    [0000.211] GSC1 Carveout Base=0x00000000ff600000 Size=0x00100000
    [0000.216] GSC2 Carveout Base=0x00000000ff500000 Size=0x00100000
    [0000.222] GSC4 Carveout Base=0x00000000ff400000 Size=0x00100000
    [0000.228] GSC5 Carveout Base=0x00000000ff300000 Size=0x00100000
    [0000.234] GSC3 Carveout Base=0x000000017f300000 Size=0x00d00000
    [0000.250] RamDump Carveout Base=0x00000000ff280000 Size=0x00080000
    [0000.256] Platform-DebugCarveout: 0
    [0000.259] Nck Carveout Base=0x00000000ff080000 Size=0x00200000
    [0000.265] Non secure mode, and RB not enabled.
    [0000.270] Read GPT from (4:0)
    [0000.398] Csd NumOfBlocks=62333952
    [0000.403] Set High speed to 1
    

    Do you have any Idea why this happens?

    Could you file a new topic or describe your setup?

    And could you share what kind of software is installed on the module?

    Hi WayneWWW,

    I was also working on the same, but am unable to fetch UART logs so far. So it would be helpful to track any related posts, by members who have the UART logs, for debugging our issue. Could I get a link to the new post here, if it is put up by reifenrath.michel.

    Hi, here is the Link to the new topic: