Xavier L4T 32.7.x kernel CPU NOC errors when flashing nvme

Dear nvidia support –

When using L4T l4t_initrd_flash_internal.sh to flash nvme on a AGX, I’m encountering a kernel CPU NOC (“network on a chip”) errors. Our product is using an AGX SOM on custom carrier board; the carrier board implements USB 3.0 hardware consistent with Xavier dev kit design. What does the NOC error indicate? What are possible corrective actions we should try?

Note that emmc can be flashed without error.

Xavier dev kit can be flashed successfully, emmc or nvme, which validates our flashing procedure.

= = = = = = = = =
[ 26.095344] CPU:0, Error:SCE-NOC@0xb600000,irq=495
[ 26.095346] **************************************
[ 26.095348] * For more Internal Decode Help
[ 26.095349] * http://nv/cbberr
[ 26.095351] * NVIDIA userID is required to access
[ 26.095352] **************************************
[ 26.095354] CPU:0, Error:SCE-NOC
[ 26.095356] Error Logger : 0
[ 26.095373] ErrLog0 : 0x80030000
[ 26.095377] Transaction Type : RD - Read, Incrementing
[ 26.095379] Error Code : SLV
[ 26.095381] Error Source : Target
[ 26.095383] Error Description : Target error detected by CBB slave
[ 26.095395] AXI2APB_5 bridge error: SFIFONE - Status FIFO Not Empty interr
upt
[ 26.095397] AXI2APB_5 bridge error: SLV - SLVERR interrupt
[ 26.095397] Packet header Lock : 0
[ 26.095399] Packet header Len1 : 3
[ 26.095401] NOC protocol version : version >= 2.7
[ 26.095403] ErrLog1 : 0x58000
[ 26.095404] ErrLog2 : 0x0
[ 26.095407] RouteId : 0x58000
[ 26.095409] InitFlow : cbb_i/I/0
[ 26.095411] Targflow : cpu_t/T/0
[ 26.095413] TargSubRange : 0
[ 26.095414] SeqId : 0
[ 26.095416] ErrLog3 : 0x80104
[ 26.095418] ErrLog4 : 0x0
[ 26.095433] Address : 0xb480104 (unknown device)
[ 26.095434] ErrLog5 : 0x387e33
[ 26.095436] Master ID : RCE
[ 26.095438] Security Group(GRPSEC): 0x3f
[ 26.095440] Cache : 0x3 – Cacheable/Bufferable
[ 26.095442] Protection : 0x3 – Privileged, Non-Secure, Data Ac
cess
[ 26.095444] FALCONSEC : 0x0
[ 26.095446] Virtual Queuing Channel(VQC): 0x0
[ 26.095454] **************************************

= = = = = = = = = =
Console kernel log:

serial_initrd_flash_nvme_failure.txt (198.5 KB)

Hi,

Have you read your boot log by yourself before? When the error happened, your board didn’t boot into nvme at all.

Hi @WayneWWW. Thank you for the assistance. Yes, that would be correct that the unit isn’t booting from nvme. The unit was placed in recovery mode and this is the console output from the AGX SOM while running l4t_initrd_flash_internal.sh to flash nvme l4t_initrd_flash_internal.sh fails to flash nvme, and the unit reboots with NOC errors. We are only seeing the NOC errors after trying to flash nvme. I’m trying to understand how NOC errors, which are very low level CPU errors, are being generated. Can the NOC error be decoded into something human readable and understandable? What are CPU NOC errors indicating? Should we be concerned?

We also ran into the security enhancements with L4T 32.7.2 where extlinux.conf cannot be modified to boot alternative kernels and device trees. We will be using L4T 32.7.1 for our development to allow modifications to extlinux.conf. But, that is just an informational point.

Here is the tail end of l4t_initrd_flash_internal.sh script output. I’ll upload the entire script output.

Cleaning up...
Finish generating flash package.
/media/vproc/FLIR/rcarter/flash-tools-xavier/build_output/sdk_32.7.2/Linux_for_Tegra/tools/kernel_flash/l4t_initrd_flash_internal.sh --external-only --skipuid --usb-instance 1-11 --device-instance 0 --flash-only --external-device nvme0n1 -c "/media/vproc/FLIR/rcarter/flash-tools-xavier/config/flir_nvme.xml" -S 820GiB flir-xavier nvme0n1p1
Start flashing device: 1-11, rcm instance: 0, PID: 38290
Log will be saved to Linux_for_Tegra/initrdlog/flash_1-11_0_20220614-052437.log 
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes: 38290
Ongoing processes:
Flash complete (WITH FAILURES)
make: *** [Makefile:146: flash] Error 1

It appears a recent kernel commit provides some background discussion on the nvidia Control Backbone (CBB) driver. I’m still at a loss to understand how this NOC error is being hit and what it might mean, as we are only seem to be encountering it while trying to flash nvme on a custom carrier.

See:
https://lkml.org/lkml/2022/5/11/1285
https://lore.kernel.org/lkml/20211221125117.6545-1-sumitg@nvidia.com/T/

Just want to remind you that your previous comment didn’t have the correct attachment.

I re-ran attempting to flash nvme using L4T 32.7.1, to avoid the CVE security fixes in CBoot and extlinux.conf, Same result. Kernel boots with NOC errors. Here are the initrd script output, console output & dmesg, and the L4T log.

agx_flash_nvme_l4t_32.7.1_script_log.txt (102.0 KB)
agx_flash_nvme_serial_l4t_32.7.1.txt (196.7 KB)
flash_1-11_0_20220628-083718.log (10.6 KB)

Could you clarify what is the exact scenario and steps to hit this issue?

Your comment so far sounds like “you try to flash nvme, it fails, so you try to boot in this situation and it crashed”.

If you don’t do any initrd flash, would you hit this issue?

Here’s the script output:

l4t_initrd_fail_flash_nvme_32.7.2.txt (104.6 KB)

@WayneWWW. Other than first putting the AGX SOM into recovery mode, I’m not manually rebooting the board at all. Steps:

  • put the AGX into recovery mode
  • launch l4t_initrd_flash_internal.sh to attempt to flash nvme. serial output from SOM is captured to a file. shell script is tee’ed to a file. L4T log file gets written automatically.
  • copy log files from remote Linux machine and post to the forum

The initrd boot appears to have failed with NOC errors? The console output is all from trying to flash nvme using the initrd script. No hanky-panky going on. ;-)

Note:

  • Custom, proprietary carrier card is in use
  • This issue isn’t seen if flashing emmc on the custom carrier card
  • exact same scripts, LT4 releases, kernel, DTB and modules work for Xavier dev kit, emmc or nvme

Just want to share some points here. Long story in short.

  1. So the log you just shared is dumped “in the meantime” when you run initrd flash and it gave failure?

  2. initrd flash may not work fine on custom board because initrd flash is unlike the flash.sh + recovery mode. Flash.sh+ recovery mode is hardware event, no matter what kind of device tree in use, it will get flashed. But not guarantee it would boot up fine.
    However, initrd flash is firstly booting up and then flash the nvme under initrd. Thus, device tree matters here.

  3. A most common problem, initrd flash actually requires usb device mode to work. But I see your kernel log says usb device mode does not work.

@WayneWWW:
1/ I’m not sure what you are asking. The file “l4t_initrd_fail_flash_nvme_32.7.2.txt” is the script output for the first NOC error I posted (could seem to edit and attached it to the earlier posts). The logs and dmesg output are as output during the initrd script execution.

2/ What is the correct way to flash a 1 TB nvme drive attached to a SOM? From what I’ve read, initrd is the way to do this. Point me in the right direction. Can you recommend a good means to test the USB-C recovery port from a running kernel?

3/ In grepping through the logs I’ve posted, "grep -Ei usb , I’m’ not seeing any USB failure messages. Is that the NOC error?

And, its not out of the question that our custom carrier has USB 3.0 issues. We are still going through hardware validation. But emmc flash.sh runs smoothly over the USB-C recovery link, taking around 100 seconds to flash.

Hi,

  1. I am just asking if you are dumping the uart log when initrd is trying to flash. Sounds like your answer is yes.

  2. There is no correct way to fix this issue now unless you get the usb port issue fixed first. Also, usb-c recovery mode has nothing to do with your issue now.
    If you want a “right direction”, share me your board schematic and describe what usb are in use on your board.

  3. [ 26.357948] Could not get extcon-dev /xudc@3550000:vbus(0)

  4. I feel you don’t understand what I am talking about. Let me just say this again. initrd_flash is not 100% guaranteed working on custom board. And that is what happened on your board.

If you just want to boot from nvme drive, then you can follow the method you posted.

https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3261/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/flashing.html#wwpID0E0QN0HA

But if you want to “flash it” by some commands from host machine, then it is not able to work on your board in current situation.

@WayneWWW Again, thanks for the assistance. You really know your way around this target. Pretty impressive!

1/ Yes, the uart (serial console) to the SOM is being dumped while the initrd script is running.

2/ Let me talk with our EE’s on the hardware design team.

3/ Ahhh! Thank you. Let me dig through the device tree.

4/ I’m moving closer to understanding this. One step forward at a time.

@WayneWWW, can we go back to the original question for just a minute? What do the NOC errors indicate? The kernel is receiving and alerting to AXI-to-APB bus errors. Is this a side effect of the USB-C device not registering successfully?

Hi,

It should be the side effect when initrd flash is running failed on your custom board. As your can see your log has some rtcpu error around the print of NOC.

I don’t suggest you keep digging about the NOC error at this moment. You should provide the environment that could support initrd flash first.

Please refer to the adaptation guide document and there is a porting usb section to configure device tree to match your hardware design. From your current error log, I can tell you are still using default device tree. Which may not work for 90% of AGX custom board.

@WayneWWW Understand. Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.