Failed to initialize the NVIDIA graphics device!

Hello. I have reviewed the other similar topics, but so far, I am unable to find a solution to this problem. My Jetson system boots up, but will not automatically go into graphics mode. I cannot get it to enter graphics mode, no matter what I do. I will attach the results of nvidia-bug-report-tegra.sh. Thanks in advance, your help is greatly appreciated!
nvidia-bug-report-tegra.log (17.6 MB)

Hi,
Please share which release you use? And do you flash the system image through SDKManager?

Thank you for your reply. We are using JetPack 4.6 (Jetson Xavier NX), with kernel 4.9 synced to Linux For Tegra (L4T) tag tegra-l4t-r32.6.1 (for both kernel and uboot). The toolchain is gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.

And yes, we do use the SDK Manager (1.8.2.10409) to flash the system, although the flash is done from the command line, not from the SDK Manager GUI, as follows:

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra
sudo ./flash.sh target mmcblk0p1

I have also tried using JetPack 4.6.1 (Jetson Xavier NX), with kernel 4.9 synced to Linux For Tegra (L4T) tag tegra-l4t-r32.7.1 (for both kernel and uboot). Same results.

Thanks again in advance for any tips or pointers. I have been working hard on this for awhile now trying to find a solution and so far it has been elusive.

Your log seems keep spewing a lots of interrupt log from mmc0. This looks abnormal.

Could you directly attach the “dmesg” from your jetson instead of using bug report?

kern  :debug : [ 1663.991165] <mmc0: starting CMD23 arg 00000018 flags 00000015>
kern  :debug : [ 1663.991168] mmc0: starting CMD18 arg 0071fbb8 flags 000000b5
kern  :debug : [ 1663.991171] mmc0:     blksz 512 blocks 24 flags 00000200 tsac 100 ms nsac 0
kern  :debug : [ 1663.991173] mmc0:     CMD12 arg 00000000 flags 00000095
kern  :debug : [ 1663.991202] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
kern  :debug : [ 1663.991560] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
kern  :debug : [ 1663.991573] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1663.991577] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
kern  :debug : [ 1663.991579] mmc0:     12288 bytes transferred: 0
kern  :debug : [ 1663.991581] mmc0:     (CMD12): 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1664.006164] sdhci-tegra 3400000.sdhci: Setting clk limit 0
kern  :debug : [ 1664.006173] sdhci-tegra 3400000.sdhci: Disabling clk 0, clk enabled 1
kern  :debug : [ 1665.834263] sdhci-tegra 3400000.sdhci: Setting clk limit 208000000
kern  :debug : [ 1665.834274] sdhci-tegra 3400000.sdhci: Enabling clk 208000000, clk enabled 0
kern  :debug : [ 1665.835735] sdhci-tegra 3400000.sdhci: req clk 208000000, set clk 195250195
kern  :debug : [ 1665.836781] <mmc0: starting CMD23 arg 00000020 flags 00000015>
kern  :debug : [ 1665.836784] mmc0: starting CMD18 arg 01114828 flags 000000b5
kern  :debug : [ 1665.836786] mmc0:     blksz 512 blocks 32 flags 00000200 tsac 100 ms nsac 0
kern  :debug : [ 1665.836788] mmc0:     CMD12 arg 00000000 flags 00000095
kern  :debug : [ 1665.836821] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
kern  :debug : [ 1665.837509] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
kern  :debug : [ 1665.837541] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1665.837546] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
kern  :debug : [ 1665.837550] mmc0:     16384 bytes transferred: 0
kern  :debug : [ 1665.837554] mmc0:     (CMD12): 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1665.838026] <mmc0: starting CMD23 arg 00000068 flags 00000015>
kern  :debug : [ 1665.838029] mmc0: starting CMD18 arg 01114848 flags 000000b5
kern  :debug : [ 1665.838030] mmc0:     blksz 512 blocks 104 flags 00000200 tsac 100 ms nsac 0
kern  :debug : [ 1665.838032] mmc0:     CMD12 arg 00000000 flags 00000095
kern  :debug : [ 1665.838057] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000001
kern  :debug : [ 1665.838853] sdhci [sdhci_irq()]: *** mmc0 got interrupt: 0x00000002
kern  :debug : [ 1665.838887] mmc0: req done <CMD23>: 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1665.838892] mmc0: req done (CMD18): 0: 00000900 00000000 00000000 00000000
kern  :debug : [ 1665.838905] mmc0:     53248 bytes transferred: 0
kern  :debug : [ 1665.838909] mmc0:     (CMD12): 0: 00000000 00000000 00000000 00000000
kern  :debug : [ 1665.854157] sdhci-tegra 3400000.sdhci: Setting clk limit 0
kern  :debug : [ 1665.854174] sdhci-tegra 3400000.sdhci: Disabling clk 0, clk enabled 1

Hi Wayne,

Thank you for taking a look. Per your request, I have captured the dmesg output and attached file “dmesg.txt” to this reply.

When I search this file for HDMI, I see numerous ENABLE and DISABLE messages, followed by something related to I2C. This catches my eye and reminds me of a post from 07/13/21 by @linuxdev in topic “My nx cannot enter the desktop system”, as follows:

If your device tree does not properly power and set up i2c for this particular carrier board, then it is not possible for EDID to succeed, and also not possible for the X server to work.

Do you think the comment above could possibly be relevant to my situation? If not, please disregard.

Thanks,
Don
dmesg.txt (223.9 KB)

Hi Wayne,

Just trying to share a little more information with you … I think it’s worth noting here that the graphics adapter on my Jetson is capable of making my monitor work in graphical mode. We know that because after building the OS and flashing the Jetson, the first boot comes up in graphical mode for the purpose of accepting the license agreement, selecting the time zone, specifying the user account, etc. That stuff is all done in graphical mode.

But when that process is done, and the system restarts, it will never successfully come up in graphical mode again. I know it is trying to run the X server, and it is using the gdm3 display manager, but the real problem appears to be that the graphics driver will not start (for reasons that I do not yet understand). My assumption is that the graphics adapter driver is “nvgpu”.

Anticipating some questions that you may ask, based on other similar topics:

1.) When I do an “lsmod” command, I do not see the “nvgpu” module listed.

jetson@jetson-desktop:~$ lsmod
Module                  Size  Used by
zram                   25920  6

2.) But the “nvgpu.ko” module is present in the rootfs, as follows:

jetson@jetson-desktop:~$ ls -al /lib/modules/4.9.253-tegra/kernel/drivers/gpu/nvgpu/
total 2376
drwxr-xr-x 2 root root    4096 Sep 21  2022 .
drwxr-xr-x 4 root root    4096 Sep 21  2022 ..
-rw-r--r-- 1 root root 2423152 Jul 26  2021 nvgpu.ko

3.) If I go to that directory and do a “sudo insmod nvgpu.ko”, it says this:

jetson@jetson-desktop:/lib/modules/4.9.253-tegra/kernel/drivers/gpu/nvgpu$ sudo insmod nvgpu.ko
insmod: ERROR: could not insert module nvgpu.ko: Invalid parameters

4.) And I then see the following new messages appear in the dmesg log that are coming from the nvgpu module:

[  618.318870] nvgpu: disagrees about version of symbol dev_warn
[  618.322562] nvgpu: Unknown symbol dev_warn (err -22)
[  618.369650] nvgpu: disagrees about version of symbol __dynamic_dev_dbg
[  618.372765] nvgpu: Unknown symbol __dynamic_dev_dbg (err -22)
[  618.375471] nvgpu: disagrees about version of symbol wake_up_process
[  618.377794] nvgpu: Unknown symbol wake_up_process (err -22)
[  618.380117] nvgpu: disagrees about version of symbol device_show_int
[  618.382446] nvgpu: Unknown symbol device_show_int (err -22)
[  618.386039] nvgpu: disagrees about version of symbol device_create_file
[  618.388396] nvgpu: Unknown symbol device_create_file (err -22)
[  618.390781] nvgpu: disagrees about version of symbol perf_trace_run_bpf_submit
[  618.393163] nvgpu: Unknown symbol perf_trace_run_bpf_submit (err -22)
[  618.396140] nvgpu: disagrees about version of symbol device_create
[  618.407490] nvgpu: Unknown symbol device_create (err -22)
[  618.410021] nvgpu: disagrees about version of symbol dev_err
[  618.412308] nvgpu: Unknown symbol dev_err (err -22)
[  618.416672] nvgpu: disagrees about version of symbol device_destroy
[  618.418986] nvgpu: Unknown symbol device_destroy (err -22)
[  618.423473] nvgpu: disagrees about version of symbol device_remove_file
[  618.425840] nvgpu: Unknown symbol device_remove_file (err -22)

5.) It looks like the error message in the subject line comes from “nvidia_drv.so”, as shown below:

jetson@jetson-desktop:/usr/lib/xorg$ grep -r "Failed to initialize the NVIDIA graphics device!" *
Binary file modules/drivers/nvidia_drv.so matches

Hope this helps!

Thanks,
Don

Hi,
Could you try latest Jetpack 4.6.2? Xavier NX developer kit should work well with default system image. Xavier NX modules have several versions(hardware versions). Probably yours is the newest one and not supported in previous releases.

Some clarification here.

  1. Please tell us if you are using nv developer kit (devkit) or some custom board. If this is devkit, then the board would be working after you flash with sdkmanager. There is no need to change anything else.

  2. Did you ever rebuild the kernel or anything else by yourself? I am just curious about those mmc print in your dmesg. They shouldn’t happen if you use default image from sdkmanager.

1.) I am not having a problem with the devkit. The devkit board works fine. The problem is when using my client’s custom board.

2.) Yes, we are re-building the kernel. As far as I can tell, we don’t change a single line of C code anywhere in the kernel, but we do replace the entire hardware device tree with their custom device tree settings.

3.) I tried JetPack 4.6.2 (Jetson Xavier NX), with kernel 4.9 synced to Linux For Tegra (L4T) tag tegra-l4t-r32.7.1 (for both kernel and uboot), but the result was the same.

Oh ok. Got your situation now.

What is the status of “lsmod” ? Is it still empty? The nvgpu driver is part of kernel module, so if it is not there, then your graphic device would be gone (as “gpu” driver is gone…).

And lsmod would be empty if you have some missing steps when building kernel.

Yes, “lsmod” is still empty. No modules are listed. I used to see “zram” listed, but that one is gone now, too. Any idea which steps might be missing when building the kernel? I have been through the log and I don’t see anything obvious that seems to be broken with the build.

Just some related comments, not an answer:

  • If there is a new kernel, then it is possible the output of “uname -r” will differ. In that case the kernel modules will need to be in a new location. What do you see for “uname -r”? If you cd here, what do you see (you might need to “sudo apt-get install tree”, or use a similar command)?
cd /lib/modules/$(uname -r)/kernel
# This should show a lot of files:
tree
  • A third party carrier board, if not an exact match to the layout of the dev kit, will need edits to the device tree. Any number of components might fail to work correctly without such edits. The i2c power for query of a monitor’s plug-n-play over the DDC wire is just one example.

Thank you for your reply, @linuxdev. The kernel modules appear to be in the correct location.

root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson# cat /proc/version
Linux version 4.9.253-tegra (root@aed-lab-1) (gcc version 7.5.0 (Linaro GCC 7.5-2019.12) ) #1 SMP PREEMPT Sun Sep 25 11:10:16 MDT 2022
root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson# uname -r
4.9.253-tegra
root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson# cd /lib/modules/4.9.253-tegra/kernel
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel#
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel#
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel#
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel# ls -al
total 76
drwxr-xr-x  8 root root  4096 Sep 25 14:13 .
drwxr-xr-x  3 root root  4096 Sep 25 13:13 ..
drwxr-xr-x  2 root root  4096 Sep 25 13:13 crypto
drwxr-xr-x 30 root root  4096 Sep 25 13:12 drivers
drwxr-xr-x  8 root root  4096 Sep 25 13:13 fs
drwxr-xr-x  3 root root  4096 Sep 25 13:13 lib
drwxr-xr-x 16 root root  4096 Sep 25 13:12 net
drwxr-xr-x  5 root root  4096 Sep 25 13:13 sound
-rw-r--r--  1 root root 42589 Sep 25 14:09 tree.txt
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel#
root@jetson-desktop:/lib/modules/4.9.253-tegra/kernel#

Regarding your second comment, I suspect you are correct. We probably have a faulty device tree file somewhere. But how do we find the bug?

Here’s a little more background. A previous engineer at this company was working on this project and was actively making edits to the device tree files just before he left. A Subversion (SVN) repo was being used to track the changes. I have taken over where he left off, and I am trying to bring the board up using the documented build procedure. Using the current state of the repo, the board boots up fine, and I am able to connect to it via SSH over the USB port. The network is functional, so I can connect to the network for the purpose of installing things (e.g. “tree”). What is NOT working is the graphical interface. That’s where I am stuck. I’m having difficulty figuring out why no modules are loading, in particular, the “nvgpu” module. I think this is the one that is preventing the X server from coming up. I am not new to Linux kernel mode development, but I am new to device tree customization.

We boot the board from an SD card. My objective has been to create an SD card that allows us to boot the board up into graphical mode. The SD card that I am able to create seems to work well in most respects, except for the fact that it does not boot up into graphical mode. That’s the probelem I am trying to fix.

We have an existing SD card that does work correctly that was left for us by the previous engineer. When I use that SD card, the board boots up into graphical mode. We don’t know precisely how he made this SD card. We assume it was made using the documented build procedure (which I am also following), but we cannot be 100% sure of that, since we don’t have a build log that tells us exactly how he created that SD card.

A question for you and the others reading this, including @WayneWWW and @DaneLLL, please … is there any way I can extract the contents of the device tree from the “good” card (i.e. the one that boots up correctly into graphical mode) and use that info to compare against the same device tree info from my “bad” card (i.e. the one that does not boot up into graphical mode)? My thought is that if I knew precisely what the device tree differences were between the two SD cards, then that might lead me to make the correct changes to my broken device tree and fix the problem once and for all.

Another note, when I boot up using the “good” card, I do see a bunch of modules (including “nvgpu”) listed when I do an “lsmod”. Also, I have extracted the kernel configuration from the good card and compared it to the one I am using, and they are identical, so that is not the problem.

Thank you for hanging in there with me! I need to find a solution soon.

Best,
Don

I can’t give you a direct answer, but I can offer some information which might make finding the problem easier.

  • Consider that if a device has an integrated driver, then even if there is a module of the same driver, no module will load.
  • If there is a module, and no integrated driver, then the driver must be told there is a reason to load, or else it must be loaded manually (e.g., “sudo insmod <path to driver module file>”).
  • In the case of a plug-n-play device, e.g., PCIe or USB, then the method of telling the driver it should try to load is automatic.
  • For all other devices the driver must be nudged. Typically, the device tree is how this would occur, although one can still manually “insmod” the device. However, if the device has any error during a manual insmod, then it still won’t load.
  • Often the first thing a device tree entry does is to name the device category and a physical address which the device responds to. The “compatible” entry selects all such drivers which are allowed to attempt to work with this device.
  • Should the device tree entry point to the wrong physical address, then the device cannot load, and you will get a failed attempt to load (contrast this with a missing attempt to load, whereby the device did not make an attempt due to not knowing it should try). Does a manual insmod work? Does it fail with an error? It might be a clue to the difference of whether the detection is missing, or if the tree is passing bogus information the driver uses during load.
  • It is concerning that no modules load. If you know from a working unit, e.g., from a serial console boot log of the same software booting on a dev kit and working, then you could look at each module which loads (or an lsmod output from the working system). You could then manually try to “insmod” particular modules…the nature of the error might offer a clue.
  • Consider that a device might not respond if:
    • The device has a bus which is not powered.
    • The device is at the wrong address.
    • Device load requires specific arguments for the driver via device tree at the moment of load, and the arguments are missing or incorrect.
    • For the case of “not powered”, consider that many devices share the same power bus; if the bus itself is what fails, then all devices on that bus will fail.
    • If you have a plug-n-play device which uses a module driver, e.g., something USB, then you know that no device tree entry is needed, but if such a device fails to attempt to load the module, the cause is different than if the device errors trying to load. If the error is to fail during an attempt to load, consider that it might be some other common requirement, such as the above mentioned failure to have a power bus available. Actually, if the plug-n-play device depends on a power bus before the automatic query can work, then it too might not even try to load. How could the hot plug system know something connected if that something can’t say it is there?
  • In some rare case there might also need to be firmware available. This is most typical of WiFi cards since regulations around the world differ, and it is easier to load firmware for each region than it is to manufacture a different card for each region. Failure to install firmware for such a case means the API is wrong, and the driver will error out making calls to the wrong API. Not likely in your case since you have no modules at all loading.

What error did you get if you modprobe nvgpu?

@WayneWWW, here you go:

jetson@jetson-desktop:~$
jetson@jetson-desktop:~$ modprobe nvgpu
modprobe: ERROR: could not insert 'nvgpu': Operation not permitted
jetson@jetson-desktop:~$
jetson@jetson-desktop:~$ su
Password:
root@jetson-desktop:/home/jetson#
root@jetson-desktop:/home/jetson# modprobe nvgpu
modprobe: ERROR: could not insert 'nvgpu': Exec format error
root@jetson-desktop:/home/jetson#

Thanks,
Don

Exec format error tends to mean that this was compiled for the wrong CPU architecture. Example: Trying to use a desktop PC module on an arm64 Jetson. If one compiles natively on the Jetson, then be sure to not specify “ARCH”.

Hello again, @linuxdev. Thank you for your reply. I am using “ARCH=arm64” when I build everything.

I am building on a desktop running Ubuntu 18.04.6 LTS.

I will paste my build procedure below. Please let me know if you see something peculiar.

Thanks!

Don

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra/sources

TOOLCHAIN_PREFIX=$HOME/l4t-gcc/gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-
TEGRA_KERNEL_OUT=$HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra/sources/build
KERNEL_MODULES_OUT=$HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra/sources/modules

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra/sources

# Run menuconfig to make sure it works.  Exit and verify that no changes were made.
sudo make -C kernel/kernel-4.9/ ARCH=arm64 O=$TEGRA_KERNEL_OUT LOCALVERSION=-tegra CROSS_COMPILE=${TOOLCHAIN_PREFIX} menuconfig

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra/sources

# Build the image.
sudo make -C kernel/kernel-4.9/ ARCH=arm64 O=$TEGRA_KERNEL_OUT LOCALVERSION=-tegra CROSS_COMPILE=${TOOLCHAIN_PREFIX} -j$(nproc) Image

# Build the device tree.
sudo make -C kernel/kernel-4.9/ ARCH=arm64 O=$TEGRA_KERNEL_OUT LOCALVERSION=-tegra CROSS_COMPILE=${TOOLCHAIN_PREFIX} -j$(nproc) dtbs

# Build the modules.
sudo make -C kernel/kernel-4.9/ ARCH=arm64 O=$TEGRA_KERNEL_OUT LOCALVERSION=-tegra CROSS_COMPILE=${TOOLCHAIN_PREFIX} -j$(nproc) modules

# Install the modules.
sudo make -C kernel/kernel-4.9/ ARCH=arm64 O=$TEGRA_KERNEL_OUT LOCALVERSION=-tegra INSTALL_MOD_PATH=$KERNEL_MODULES_OUT modules_install

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra

# Copy kernel generated.
cp -rfv ./sources/build/arch/arm64/boot/Image kernel/

# Copy device tree generated.
cp -rfv ./sources/build/arch/arm64/boot/dts/* kernel/dtb/

# Copy new modules.
sudo cp -arfv ./sources/modules/lib rootfs/

cd $HOME/nvidia/nvidia_sdk/JetPack_4.6_Linux_RECORDER_TARGET/Linux_for_Tegra

# Flash the whole image.
sudo ./flash.sh recorder mmcblk0p1

@WayneWWW @DaneLLL @linuxdev

When I try to “insmod nvgpu.ko”, I get “invalid module format”.

Something seems to be wrong with my build, but I don’t know what it is. I seem to have a mismatch between the kernel version I am actually running on the target board, and the kernel version the modules are built for. But again, I don’t know how to fix it.

The kernel I get from JetPack 4.6 is called “4.9”. And yet, when I am up and running on my target board, the kernel version is “4.9.253-tegra”.

I am re-building everything, the OS, the modules, everything. And yet, I seem have a kernel version mismatch.

I see in my make lines “LOCALVERSION=-tegra”, so I know where that part comes from. But where does the 4.9.253 number come from?

Any ideas how to fix it, please?