Board detection issue with Jetpack 5.1

I’ve just had a call with nvidia and this topic came to speech:
https://forums.developer.nvidia.com/t/flashing-nano-devkit-nvme-over-usb-fails

Since the issue in the topic turned out to be something completely else, which is not tracked in the bug tracking yet, I was told to create a new topic for the actual issue:

Steps to reproduce:

  • Connect no Board to the Host
  • Run l4t_initrd_flash.sh with the --no-flash option
  • Run l4t_initrd_flash.sh with the --flash-only option

Observed:
The command succeeds using the SKU “0000” and defaulting to the device tree of the Orin config.
Device tree for SKU 0000 is flashed to any Orin Board

Expected:
The command should fail in the “–no-flash” phase because the BoardID and SKU is not set and could not be obtained from the device.

The fact that the script does not throw any error anymore causes issues as this leads to boards being flashed with incorrect device trees. There is no way to notice that this has happened except checking that SKU() is printed by the log.

So will it be good if we add something like "it is using offline?

Currently if it is offline mode, then the module info in the flash log should be empty.

I think the system did exit on error in the past but I might be wrong.
Forgive me that the post became a bit longer, but I feel like it needs some explanation.

Possible cases:

  • Board connected
  • Board NOT connected
  • SKU/BoardID set
  • SKU/BoardID NOT set

Possible combinations:

  • Board connected + SKU/BoardID set → Use the overwritten SKU/BoardID, ignore the connected board
  • Board NOT connected + SKU/BoardID set → Use the overwritten SKU/BoardID
  • Board connected + SKU/BoardID NOT set → Read the BoardID/SKU from the connected board
  • [This does not behave as expected] Board NOT connected + SKU/BoardID NOT set → Error, do not continue with defaults

More details for the developers:

The flash.sh script which is called by l4t_initrd_flash.sh calls the function “update_flash_args” which is defined in the board config files. That helper function is responsible to parse the BOARDID and the SKU to set the correct DTB path. It relies on the info to be correct, otherwise it will use some default values. In the case above the BOARDID and SKU is unset when calling this function!

If the SKU is not set the function “get_board_version” should be executed and finally run into this line:

chkerr "Reading board information failed.";

The function “get_fuse_level ()” is used to get the hwchipid, which is later used in the pre-processing of “get_board_version()”
When the board is not connected the hwchipid will not be set.

The following function in flash.sh is responsible to get BOARDID/FAB/BOARDSKU if FAB is not set by the user.
As the hdchipid has not been set due to the board not being connected the whole thing is skipped and the content is just printed.

If there is no FAB and no hwchipid there should be an error in my eyes.

# get the board version and update the data accordingly
if declare -F -f process_board_version > /dev/null 2>&1; then
	board_FAB="${FAB}";
	board_id="${BOARDID}";
	board_sku="${BOARDSKU}";
	board_revision="${BOARDREV}"
	if [ "${board_FAB}" == "" ]; then
		if [ "${hwchipid}" != "" ]; then
			get_board_version board_id board_FAB board_sku board_revision emcfuse_bin;
			_nvbrd_trk=1;
			BOARDID="${board_id}";
			BOARDSKU="${board_sku}";
			FAB="${board_FAB}";
			BOARDREV="${board_revision}";
		fi;
	fi;
	process_board_version "${board_id}" "${board_FAB}" "${board_sku}" "${board_revision}" "${hwchiprev}";
fi;

Possible solution, at least in this case:

  • still execute the get_board_version if the hwchipid is not set.

The script has many ways to detect this error, they just are not used.

Comparison of simple flash.sh and l4t_initrd_flash.sh, both with no board connected:

flash.sh → correct

sudo ./flash.sh jetson-xavier-nx-devkit mmcblk0p1
###############################################################################
# L4T BSP Information:
# R35 , REVISION: 3.1
# User release: 0.0
###############################################################################
Getting fuse level
RCM command was ./tegrarcm_v2   --uid
ECID is
Error: probing the target board failed.
       Make sure the target board is connected through
       USB port and is in recovery mode.

l4t_initrd_flash.sh → incorrect (added some debug logs in the script)

sudo ./tools/kernel_flash/l4t_initrd_flash.sh jetson-xavier-nx-devkit mmcblk0p1
/mnt/wsl/data/projects/oms5/Linux_for_Tegra/tools/kernel_flash/l4t_initrd_flash_internal.sh --no-flash jetson-xavier-nx-devkit mmcblk0p1
************************************
*                                  *
*  Step 1: Generate flash packages *
*                                  *
************************************
Create folder to store images to flash
Generate image for internal storage devices
Generate images to be flashed
ADDITIONAL_DTB_OVERLAY=""  /mnt/wsl/data/projects/oms5/Linux_for_Tegra/flash.sh --no-flash --sign  jetson-xavier-nx-devkit mmcblk0p1

###############################################################################
# L4T BSP Information:
# R35 , REVISION: 3.1
# User release: 0.0
###############################################################################
Getting fuse level
RCM command was ./tegrarcm_v2   --uid
ECID is
HWchipid is set to
Board ID() version() sku() revision()

Here the flash process continues instead of exiting with an error.
When running flash.sh only, this does not happen as this is executed:

# SoC Sanity Check
if [ ${no_flash} -eq 0 ]; then
	chk_soc_sanity;
fi;

@kayccc An Nvidia representative has told me that my solved threads are not considered for bugfixes anymore. I guess the thread should not be marked as solved?