Device Tree customization for AGX

My organization is building a custom carrier board for the AGX Xavier. Therefore, to boot linux, we need a custom kernel. I’ve read extensive Nvidia documentation and quite a few blog posts, but still not yet gotten a custom kernel that boots on our carrier board.

My first step is to simply to build a “vanilla” kernel, without any customizations. I’m following the instructions here:

https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3261/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/kernel_custom.html#

All of our customizations are based off release 32.6.1. I’ve succeeded at everything on the above page up to (sign and encrypt kernel, kernel-dtb, and initrd binary files (which I have not yet done.) So, my questions…

Q1: is it absolutely necessary to sign and encrypt everything? What are the consequences of not signing?

Q2: After completing these instructions, how do I flash the target AGX system? Simply boot it into force-recovery mode, connect a USB cable to the OTG port, and execute

sudo ./flash.sh jetson-xavier-agx mmcblk0p1

from the Linux_for_Tegra directory my Ubuntu host PC where I’ve compiled the kernel?

Q3: We have created pinmux files for our custom carrier board, using the Pinmux Excel file here:
https://developer.nvidia.com/embedded/dlc/jetson-agx-series-devkit-pinmux-configuration-template

It’s not clear what I should do with the output .dtsi files afterwards. The blog post below mentions a python script that generates .cfg files, but it’s not clear what I should do with these .cfg’s after I’ve generated them, either:

My missing piece of documentation is the bit that describes how to integrate a custom pinmux into a standard L4T kernel build. Can someone point me in the right direction?

I won’t give a complete answer, but realize that all partitions, other than the rootfs/APP, are signed. If there is no encryption key used, e.g., when fuses are not burned, then the key is just a NULL, but the partitions are still signed. A valid signature (even if it is signed by NULL) is needed to use such a partition.

The command line you gave for flash is basically correct, but it is intended for the dev kit. The part which differs (mostly) is that you will need a different device tree. I strongly suggest you use this command to flash a dev kit, and keep a log. This will show you where files and images come from and will save a lot of effort when replacing this with your device tree and kernel. Example to log when flashing:
sudo ./flash.sh jetson-xavier-agx mmcblk0p1 2>&1 | tee log_flash.txt

Most of the rootfs image is created from “Linux_for_Tegra/rootfs/”, but depending on arguments passed to the flash script, you will find some content in “Linux_for_Tegra/rootfs/boot” is copied in just before creating the final rootfs image.

Also, kernel and device tree content in the rootfs can be used instead of partitions unless security fuses are burned.

[Note: I am welshguy’s consultant, taking over from welshguy on the question above.] I’ve made some progress but of course have more questions.

  1. I ran your flash command and generated a log file as you suggested. (Note: for me, the command I needed to use was:

sudo ./flash.sh jetson-xavier mmcblk0p1 2>&1 | tee log_flash.txt

Your board ID argument (jetson-xavier-agx) gives me Error: Invalid target board - jetson-xavier-agx. (Perhaps because I am doing all of this on a 32GB non-industrial AGX attached to an Auvidea X220 carrier board, instead of the official NVidia dev kit?) Regardless, I am able to flash my AGX successfully, and generate the log, using the above command.

  1. From the log, it appears that these are the pinmux and padvoltage files that are actually being copied to the AGX:

Linux_for_Tegra/bootloader/t186ref/BCT/tegra19x-mb1-pinmux-p2888-0000-a04-p2822-0000-b01.cfg
Linux_for_Tegra/bootloader/t186ref/BCT/tegra19x-mb1-padvoltage-p2888-0000-a00-p2822-0000-a00.cfg

For our homebuilt carrier board, should I simply replace these files with files of the same names, generated by pinmux-dts2cfg.py from our own DTST files? (And then re-flash the AGX with our own .cfg files?)

  1. How would I know if the security fuses have been burned or not? Can you point me at a command I can run on the AGX, or something to look for in the flash log?

I’m the wrong guy to answer this, but probably replacing those .cfg files would work if they are correctly edited, but it might (I have not tried this myself) be better to edit the main target config file to change the target specs. To explain this, note that when you flash to a target, e.g., to “jetson-xavier”, that this really refers to the “jetson-xavier.conf” file, which itself might be a symbollic link, e.g., to “p2822-0000-p2888-0004.conf”. That latter file name is related to both module and carrier board, and you could create something else, name a conf file for it, and use that as your flash target. This file is human readable, and basically specifies PINMUX (which is device tree lane routing), and perhaps a second config. Note that much will be “in common” for your board and others since the module itself is still in common.

I have not actually done that myself, so I can’t give more advice on it, but this is more or less the same way to do what you suggested, to change the device tree by replacing what the system was using (except that it would be the “proper” way without hacks).

Thanks linuxdev, that actually works. I can create my own top-level jetson-xavier-tim.cfg file which references my custom PINMUX_CONFIG and PMC_CONFIG files. Running flash.sh with the <board_id> argument set to jetson-xavier-tim does actually pick up my customized pimux and padvoltage .cfg files. Cool.

What is the answer to my question about how to determine whether or not the security fuses have been burned? I see a lot of extensive documentation on this topic, but can’t find an answer to that specific question.

I have not used a board with burned fuses, so I can’t say for sure. Someone from NVIDIA would be able to answer.

What I can tell you is that content which exists in a partition which does not match the signature will be rejected, and that the same is true regardless of whether the fuse is burned or not, but the non-fused version uses a NULL key by default. Also, content named in “/boot/extlinux/extlinux.conf” is accepted if fuses are not burned, but rejected if fuse is burned. You could likely tell from a serial console boot log based on what it says about extlinux.conf parameters (or what it does not allow to be read), but for a board about to be flashed in recovery mode (or one fully booted) I don’t know how to answer that.

It appears that the encryption fuses are not burned on my AGX module. That’s how they come from the factory, and we’ve certainly never done anything to burn them. So I will put this question to bed for now.

Here’s a more pressing question.

We’ve gotten our kernel image and modified device trees flashed onto the AGX now. When we boot the AGX with our kernel, on our carrier board, we see the start of a normal boot sequence on the USB serial debug console. So, progress!

[0000.085] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.094] I> MB1 (prd-version: 1.5.1.7-t194-41334769-98030a79)
[0000.099] I> Boot-mode: Coldboot
[0000.102] I> Chip revision : A02P
[0000.105] I> Bootrom patch version : 15 (correctly patched)
[0000.110] I> ATE fuse revision : 0x200
[0000.114] I> Ram repair fuse : 0x0
[0000.117] I> Ram Code : 0x2
[0000.120] I> rst_source : 0x0
[0000.122] I> rst_level : 0x0
[0000.126] I> Boot-device: eMMC
[0000.141] I> sdmmc DDR50 mode
[0000.145] W> No valid slot number is found in scratch register
[0000.150] W> Return default slot: _a
[0000.154] I> Active Boot chain : 0
[0000.157] I> Boot-device: eMMC
[0000.160] W> MB1_PLATFORM_CONFIG: device prod data is empty in MB1 BCT.
[0000.168] I> Temperature = 39000
[0000.171] W> Skipping boost for clk: BPMP_CPU_NIC
[0000.175] W> Skipping boost for clk: BPMP_APB
[0000.180] W> Skipping boost for clk: AXI_CBB
[0000.183] W> Skipping boost for clk: AON_CPU_NIC
[0000.188] W> Skipping boost for clk: CAN1
[0000.191] W> Skipping boost for clk: CAN2
[0000.195] E> MB1_PLATFORM_CONFIG: pad voltage config table is empty in MB1 BCT.
[0000.202] E> MB1_PLATFORM_CONFIG: pinmux table is empty in MB1 BCT.
[0000.208] I> Boot-device: eMMC
[0000.211] I> Boot-device: eMMC
[0000.220] I> Sdmmc: HS400 mode enabled
[0000.225] I> ECC region[0]: Start:0x0, End:0x0
[0000.229] I> ECC region[1]: Start:0x0, End:0x0
[0000.233] I> ECC region[2]: Start:0x0, End:0x0
[0000.237] I> ECC region[3]: Start:0x0, End:0x0
[0000.241] I> ECC region[4]: Start:0x0, End:0x0
[0000.245] I> Non-ECC region[0]: Start:0x80000000, End:0x100000000
[0000.251] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.256] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.260] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.265] I> Non-ECC region[4]: Start:0x0, End:0x0
[0000.270] E> FAILED: Thermal config
[0000.275] E> I2C: Timeout while polling for transfer complete. Last value 0x00000002.
[0000.282] E> I2C: Could not write 0 bytes to slave: 0x0078 with repeat start true.
[0000.289] E> I2C_DEV_BASIC: Failed to send register address 0x53.
[0000.295] E> I2C_DEV_BASIC: Could not read data of size 2 at register address 0x0053 from slave 0x78 via i
[0000.304] E> I2C_DEV_BASIC: Failed to update 2 byte value at register 0x53 of slave 0x78 via instance 4
[0000.313] C> NONE: Failed to update reg address 0x53 of slave 0x78 in i2c block :0 in pad voltage config table.
[0000.322] E> FAILED: Generic i2c config
[0000.329] E> FAILED: MEMIO rail config
[0000.347] I> Boot-device: eMMC
[0000.356] I> sdmmc bdev is already initialized
[0000.429] I> MB1 done

This looks identical to the boot console output from the exact same AGX, with the exact same kernel, on our auvidea X220 carrier board, with one exception: the MB1_PLATFORM_CONFIG errors starting at timestamp [0000.195]. These two errors make it suspicious that something is wrong with our generated pinmux and padvoltage config files - those are the only two we changed.

The next errors we see are on timestamp [0000.275], involving a missing I2C device.
This suggests to me that our carrier board is missing some I2C device that our kernel expects. But what I2C device? I’ve grepped the entire kernel_src tree for the error messages in the console boot log (“Timeout while polling for transfer complete”, “Failed to send register address”, etc., and have not found them.

So this is where we are now stuck. We’d really appreciate some thoughts/suggestions for how to attack this question: what’s wrong with our pinmux and padconfig files? (They don’t look obviously empty or corrupted.) And what I2C part is our carrier hardware missing, based on the error messages in the serial console output?

Pads and power setup involves device tree. I cannot say anything about your specific error, but the device tree is also related to i2c routing (though sometimes i2c is used to access a ROM not existing on all carrier boards). It is highly likely that at least some of the issues are due to required device tree modifications (which is what the “pinmux” table is about, and the PINMUX spreadsheet tool). The PINMUX describes your carrier board, and will change from default as soon as the layout of the carrier changes compared to what the device tree was originally written for.

Do note that there is more than one way to specify device tree load, e.g., from a partition versus from a file (file requires the FDT attribute in extlinux.conf to be set, partition requires the signed dtb be flashed to a partition; usually one develops by file, and then flashes to partition when it is ready). One can see a reflection of the existing device tree in “/proc/device-tree”. A source code version of the running device tree can be created with:
dtc -I fs -O dts -o extracted.dts /proc/device-tree
(if you don’t have dtc on your Jetson, then you can run “sudo apt-get install device-tree-compiler”)

Auvidea supplies their own flash software, much of which is device tree, while other parts are often the same as the NVIDIA default dev kit software.

Also, note that there may be cases where there is an i2c query by default to discover if something is present. In this case with a third party carrier board, then perhaps it is actually something missing, but that might be due to just the device tree being incorrect and thus not being able to find the hardware. There might be other cases where the i2c query not returning anything is ok. An example of an “ok” query is that HDMI uses i2c on its DDC wire to query the monitor for its capabilities, and if the HDMI is not plugged in, then there would be a failed i2c query (well, in reality it wouldn’t happen because HDMI has a hot plug detect, and so it wouldn’t query a missing monitor unless there were a hard wired HDMI solution).

Thanks @linuxdev . We’ve actually gotten quite a bit farther now. The above problem (with our pinmux and padvoltage files) went away after a clean rebuild and reflash of our AGX. Now, our AGX gets much farther when booting on our custom carrier board.

Interestingly, when we plug the exact same AGX, flashed with exactly the same firmware, into our X220 carrier, then it boots all the way to a login prompt on the micro-USB-serial console. We’re able to compare the console output from the X220 boot-up to the console output on our own board. Here is what we see now, with our board plugged into the X220. This is the very end of the boot output; right after this, we get a login prompt:

[    8.460802] Root device found: mmcblk0p1
[    8.462667] Found dev node: /dev/mmcblk0p1
[    8.487034] EXT4-fs (mmcblk0p1): recovery complete
[    8.487134] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[    8.489756] Rootfs mounted over mmcblk0p1
[    8.507238] Switching from initrd to actual rootfs
[    8.594786] systemd[1]: System time before build time, advancing clock.
[    8.605540] cgroup: cgroup2: unknown option "nsdelegate"
[    8.611036] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[    8.611934] systemd[1]: Detected architecture arm64.
[    8.617233] systemd[1]: Set hostname to <eoi-host>.
[    8.712275] systemd[1]: File /lib/systemd/system/systemd-udevd.service:35 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[    8.712573] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[    8.827656] random: systemd: uninitialized urandom read (16 bytes read)
[    8.830296] systemd[1]: Created slice User and Session Slice.
[    8.830776] random: systemd: uninitialized urandom read (16 bytes read)
[    8.830929] systemd[1]: Reached target Swap.
[    8.831330] random: systemd: uninitialized urandom read (16 bytes read)
[    8.831936] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[    8.833245] systemd[1]: Created slice System Slice.
[    8.833711] systemd[1]: Listening on RPCbind Server Activation Socket.

And here is what we see with the exact same AGX plugged into our own custom carrier board:

[   23.627650] Root device found: mmcblk0p1
[   23.629383] Found dev node: /dev/mmcblk0p1
[   23.653460] EXT4-fs (mmcblk0p1): recovery complete
[   23.653563] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[   23.656075] Rootfs mounted over mmcblk0p1
[   23.673445] Switching from initrd to actual rootfs
[   23.759049] systemd[1]: System time before build time, advancing clock.
[   23.770142] cgroup: cgroup2: unknown option "nsdelegate"
[   23.775772] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[   23.776731] systemd[1]: Detected architecture arm64.
[   23.782784] systemd[1]: Set hostnaÌ

The boot process hangs up here, right near the very end. We’ve passed through the earlier boot phases (micro boot 1, micro boot 2, cboot, etc.) without any trouble. It looks like one of the systemd services is locking up, for some unknown reason, on our board. (But which one?)

Here is one speculation:

There is an apply_binaries.sh script which appears to copy drivers and device tree files into the kernel subdirectories of the rootfs that gets flashed to the AGX. Perhaps we are failing to run that, or somehow running it incorrectly before flashing the AGX. That might result in kernel files on the rootfs which do not correspond to our custom board’s layout. I’ve read through apply_binaries.sh a number of times and still do not entirely understand what it does. I do know that Auvidea wants us to run that script after copying their own kernel files into the top-level Linux_for_Tegra directory. Any enlightenment here would be helpful.

Thanks again.

The above means the system was shut down improperly, and perhaps there is now missing file system content. ext4 is a journal-based file system, and it can prevent corruption by replaying the last writes which were not sync’d and removing those, but this does not mean the content is saved. This might or might not be related to your issues, it is hard to say (most of the time it won’t be an issue if the journal does the recovery, but it could be). This is just one of those wild cards you can’t be certain about if shutdown is not correctly performed, e.g., if power is cut or if there is some sort of lock-up.

The sample rootfs is purely Ubuntu (18.04 for most releases, the latest JetPack 5 developer preview is for Ubuntu 20.04), and thus licensing is unmodified for distributing this. The end user is the one who runs “apply_binaries.sh” (automatically from JetPack/SDK Manager, or manually if manually installing), and as you say, this installs NVIDIA-specific drivers and software (basically direct hardware access content). This only needs be done once. This won’t be changing anything related to what you are doing.

What will have an effect is how the rootfs image is generated. Mostly that image is a copy of the “Linux_for_Tegra/rootfs/” content, but some files in “rootfs/boot/” will change depending on arguments passed to the flash software.

It is easiest to explain based on command line flash with “flash.sh”, but the process is the same when run through the GUI. An example command line flash might be:
sudo ./flash.sh jetson-xavier mmcblk0p1

The “jetson-xavier” refers to the config file “jetson-xavier.conf”. This in turn mentions other config based on some particular carrier board. Based on this being an AGX, and based on a particular carrier board, the kernel Image file and device tree may be changed in “rootfs/boot/” prior to creating the rootfs image. Depending on boot options the "rootfs/boot/extlinux/extlinux.conf" may also be changed. Once those are in place the partition image is created as “Linux_for_Tegra/bootloader/system.img.raw” (and the sparse version, “bootloader/system.img”). This contains those updated kernel, device tree, and extlinux.conf files.

When the flash decides to copy the content in it will copy a reference version of various files to either the “bootloader/” or “kernel/” directories, and then copy that file into “rootfs/”. To know which one is copied it is easiest to just log a command line flash and read the logs. An example is:
sudo ./flash.sh jetson-xavier mmcblk0p1 2>&1 | tee log_flash.txt

You already know about the custom .cfg file since you are using that, but be aware that you can customize the sub-components which are copied as well (e.g., you can make your own reference copy of content with a new name…I have not done so myself, but that is the purpose of separation of carrier board config and module config into human-readable config files). I do not know if perhaps flash put some default file in which stops boot from completing or not, but you could flash on command line using your config file and look at the logs to see if the wrong content was copied or not. If not, then you know to edit your content.

There might also be some question of whether the initrd works with your setup. I could not say, but perhaps the logs will provide a hint as to how the initrd was created (perhaps it failed to use your device tree, though likely that isn’t a problem; it does make a good example of what can go wrong).

Note that if the initrd was running, then the Linux kernel was running in a limited RAM-disk system, and that this must have at least partially succeeded because the ext4 file system was repaired. Perhaps the repair is why it doesn’t work? Don’t know. However, the initrd does successfully complete its job and then pivot_root to the mmcblk0p1 partition. This is when it goes wrong. This could be because the initrd did not set up properly before pivot_root, or it could be because the ext4 file system on mmcblk0p1 is not valid based on what was passed.

Incidentally, the kernel is process ID (PID) 0. The kernel technically is running only a single program, that program being “init”. Systemd is the part of init which brings up the systems in a “somewhat” object-oriented way (I consider systemd most of init, although in older systems this was just a bunch of bash shell script files). If you can’t start systemd (init), then the system panics and cannot continue. Looks like systemd partially started since it detected arm64, but nothing else continues…basically init is dying almost instantly upon pivit_root to the eMMC.

I would not typically expect a simple journal-based file system repair to cause init to fail almost instantly. This failure is in fact likely the reason the system was not shut down cleanly…there was probably no chance of that. So I suspect something was incorrect about the rootfs or the initrd, but since the partition was found, and since the pivot_root is eMMC, I suspect that either it is the rootfs or the device tree at issue (init can’t work very well if it goes to use hardware and the device tree causes the hardware to be missing).

1 Like

Thanks @linuxdev for the detailed explanation. The final boot problem turned out to be hardware-related. The culprit was not a corrupted filesystem or an incorrectly-configured kernel/device tree.

Anyhow, we are able to boot the AGX all the way to a console login prompt on our custom carrier board now. There may be other problems ahead … but this is a big milestone for us. Have a great weekend.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.