Error found in TX2 Boot: "nvdc: open: Permission denied" and ***NvRmMemInit failed***

Hi.

This is TX2 NX with a custom carrier and custom compiled kernel.

Recently when booting up our jetsons (it’s happening to more than one) we started getting this message:

After we hit ok and reboot, we then get this one:

After this, we hit ok, reboot and it boots up normally to the desktop.
Clearly something is wrong here and we’re not sure what this means and how to fix it.

Full dmesg attached:
dmesg_out.txt (68.5 KB)

Thank you.

Anyone knows anything that might help? This is a little urgent!

Hi @linuxdev
Maybe you can help me with this since, from what I’m seeing this is more a Linux thing rather than a jetson thing, I’m guessing. I really have no idea what I did or happened to create this problem.

This is specific to the NVIDIA GPU software. I don’t know enough about that software to be able to answer. About all I can say is that if you can run the following command, then permissions in general are likely correct for regular user space software:
sudo ls
(just testing sudo, not really interested in ls output)

Since you compiled your own kernel, then it is also possible a module related to the display controller is missing or in some way failing to load. I couldn’t say what, but a list of the following modules might be useful:
find /lib/modules/$(uname -r) -iname 'nv*'

Also, what do you see from “lsmod”? Do you have any other kernel configuration which works, and in which you can show a comparison “lsmod” from?

Last, what L4T release is this based on (“head -n 1 /etc/nv_tegra_release”)?

Hi, @linuxdev , thank you for replying.

sudo ls runs fine. So I guess permissions are correct.

We have our custom kernel source in VCS (and it’s not that heavly modified) and so I’ve been looking through the changes a lot, so I think it’s fair to say that there isn’t any obivous mistake I made, which doesn’t exclude that it’s not there though.

I’m not sure this is what you were looking for but here is the output from the jetson that is running into these issues (i’ll call bad jetson from now on) and one jetson that we have that is not running into them (i’ll call good jetson):

find /lib/modules/$(uname -r) -iname 'nv*'

This output is the same for both jetsons

lsmod output form bad jetson:
lsmod_bad_jetson.txt (1.3 KB)

lsmod output from good jetson:
lsmod.txt (1.3 KB)

head -n 1 /etc/nv_tegra_release output:
# R32 (release), REVISION: 6.1, GCID: 27863751, BOARD: t186ref, EABI: aarch64, DATE: Mon Jul 26 19:36:31 UTC 2021

I really have no hints about what this could be, through all the search I’ve been doing.

I see no difference between the “good” and “bad” lsmod commands other than the “bad” one has more modules using nvgpu. Not really an issue so far as I can tell.

Can you post the output of “ls -l /dev/tegra_dc*” on both a “good” and “bad” Jetson? It might be useful to see if permissions actually differ.

Hi @linuxdev,

We got the following outputs for both a successful boot (without any errors showing up) and an unsuccessful one:

WhatsApp Image 2022-09-17 at 17.28.49

We would like to know if there is a way to proceed with the boot automatically when this error appears. This is because we have this jetson in a drone and we need it to start automatically without requiring someone to press ‘ok’ to close this window if it appears.

One significant cue we have about this issue is that this seemed to happen after we went through the installation process of a 5G modem on our jetson tx2.

This is the modem:

Its installation steps can be found here:

Assuming that this was the cause for these errors, here is the “install.sh” script:

#Update KO
uname_r=$(uname -r)
echo $(uname -a)

cd option
make
mv /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option_bk.ko
cp option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/
cd ..

cd qmi_wwan_simcom
make
cp qmi_wwan_simcom.ko /lib/modules/$(uname -r)/kernel/drivers/net/usb
cd ..

depmod
modprobe option
modprobe qmi_wwan_simcom
modprobe -r qmi_wwan_simcom
modprobe qmi_wwan_simcom
dmesg | grep "ttyUSB"
dmesg | grep "qmi_wwan_simcom"

#add DNS file
mkdir -p /usr/share/udhcpc
sudo chmod 777 default.script
sudo cp default.script /usr/share/udhcpc

and the output:

sudo ./install.sh 
Linux heifu-tx2-nx 4.9.253-tegra #1 SMP PREEMPT Fri Nov 5 15:14:33 WET 2021 aarch64 aarch64 aarch64 GNU/Linux
make -C /lib/modules/4.9.253-tegra/build -I ./usb_wwan SUBDIRS=/home/heifu/Sim8200_for_jetsonnano/option modules
make[1]: Entering directory '/usr/src/linux-headers-4.9.253-tegra-ubuntu18.04_aarch64/kernel-4.9'
  CC [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/option/option.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.9.253-tegra-ubuntu18.04_aarch64/kernel-4.9'
rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions Module.* modules.order
make -C /lib/modules/4.9.253-tegra/build M=/home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom modules
make[1]: Entering directory '/usr/src/linux-headers-4.9.253-tegra-ubuntu18.04_aarch64/kernel-4.9'
  CC [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.ko
make[1]: Leaving directory '/usr/src/linux-headers-4.9.253-tegra-ubuntu18.04_aarch64/kernel-4.9'
modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
[    2.002333] usb 1-2.1: GSM modem (1-port) converter now attached to ttyUSB0
[    2.002618] usb 1-2.1: GSM modem (1-port) converter now attached to ttyUSB1
[    2.002916] usb 1-2.1: GSM modem (1-port) converter now attached to ttyUSB2
[    2.003208] usb 1-2.1: GSM modem (1-port) converter now attached to ttyUSB3
[    2.003500] usb 1-2.1: GSM modem (1-port) converter now attached to ttyUSB4

Is there something else that you would like us to try which may help us to better understand the issue?

Thank you for your help.

Hi again. (BTW @jbalao is my coworker).

We were able to determine that these errors mentions in the first post start happening on jetsons where we installed the 5G driver mention previously by jbalao.

Now, since the errors happen when loading /etc/profile, we ran some tests.

In the script /etc/profile we commented the loop that runs all the *.sh scripts inside profile.d
Because this error didn’t show up every time, we can’t conclude anything with absolute certainty but since we commented all these scripts, we booted up dozens of times and didn’t see the error anymore.

Now, assuming it’s one of the scripts inside profile.d that is throwing up the problem, we looked through them and tried only commenting the ones that looked like more trouble.

After some tests, and by tests I mean commenting scripts and booting up the jetson a whole lot of times, we are almost certain that the script /etc/profile.d/jetson_env.sh is giving that problem.

What we now have is the /etc/profile running all scripts normally, except the jetson_env.sh script (since we renamed it to not match the *.sh pattern). We now have never had any more problems booting up and encountering any errors that I’ve shown in the first post.

My question is then, what does that script do and how essential it is ?

We understand that this is a temporary solution for a problem we need to look further into, since what causes this is the 5G driver and not the scripts in /etc/profile.d, but for now we want to know what that script in particular does, any clues to what might be happening,

Either way, it’s not exacly fixed but we are not getting any of those errors anymore with jetson_env.sh` disabled.

This isn’t in order, I’m just making notes as I go.

  • Does the command line prompt lock up after the lsmod or “ls /dev/tegra_dc_*”?
  • What is shown from the command:
    grep video /etc/group
  • Is it possible to get a “dmesg” log which occurs just after the failure is noticed (admittedly this won’t be possible if there is a communications failure and it is in the air)?
  • Even though this is on a drone, is there a possibility of accessing the Jetson during the failure even if it is just for testing without the drone being in the air?
  • The kernel module “mv /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option.ko” is being replaced by a third party module. It is possible this is compiled against a different kernel release or is in some way not compatible with this hardware if “option.ko” is used for anything other than this one device. Does the manufacturer make this module available as source code, or is it purely binary?
  • Similar for module “qmi_wwan_simcom.ko”. Is this available as source code, or is it just binary?
  • Actually, I see this did compile against the kernel headers, and so this is in source form, and it was compiled for this kernel, so disregard anything about it maybe not being built against this kernel. However: the compile log says this is built for the wrong architecture. This is what the exec format error is about"
modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

It is as if this was cross compiled incorrectly on a host PC and not on the Jetson. Were the install steps performed on a PC in any part at all?

Note that if something in “/etc/profile” needs the qmi_wwan_simcom module loaded, then this would be a mandatory failure. I think knowing why this has exec format error is key to solving it.

Hi.
To answer your questions I reverted the previously mention temporary solution:

So now we have the error popping up again. Now to answer your questions:

No. If by ‘lock’ you mean it gets stuck and unusable, it does not happen. We get the expected output. (I’m not sure if the output itself is relevant so I’ll omit it for brevity).

video:x:44:heifu,gdm,lightdm

While we are working on a drone, most, if not all, tests are done on bench, with just our motherboard connected to power supply and all relevant modules, you can assume the drone is not flying and we’re always able to access the jetson. (Even if it’s in the air, we have ways to do so). To answer the question, the “failure” is that popup during boot. The exact sequence of events is: jetson powerup → company logo → popup with error found when loading /etc/profile ... → nvidia logo → desktop.
So I can’t really give you the dmesg log just after the failure, but I’ll attach the full dmesg log anyway, in case it helps.
dmesg.txt (63.8 KB)

Something we didn’t mention that might’ve been important, these drivers are for the Jetson Nano, while we’re using a TX2 NX. The source code is available here: wget https://www.waveshare.net/w/upload/0/07/Sim8200_for_jetsonnano.7z, which you get from
from this url: https://www.waveshare.com/wiki/SIM8200EA-M2_5G_HAT# , down in Use with Jetson Nano.

We are not sure how the differences about using Jetson Nano drivers on TX2 NX but as you said, we did notice aswell that it was compiled for this kernel, hence why we’re so confused on why we get the Exec format error mentioned.

Can you explain this a little bit better for us? As in, which architecture was expected (if you can). Does this have to do with the fact it’s a TX2 NX instead of a Nano?

Also, just as a note, because we have the source code of the driver, we’re not afraid of looking through the code and try to make the necessary changes to fix this problem, if it is possible at all. If you look through the driver source code, can you spot anything that might explain the exec format errors?

We will focus on trying to understand this, if you think it’s the right direction.

I guess this is already a bit long post but one more question. Since we’re using drivers for Nano, what exacly differs from the TX2 NX, in terms of driver implementation? How far are we from changing the source implementation to, i guess, “fit” our environment?

I realize it’s a long post so, thank you for your patience and help.

Can you attach a copy of “/etc/profile.d/jetson_env.sh” here?

From what you say it seems this is an ordinary boot popup and there has not yet been any attempt of a user login. Can you verify this is correct? Is there any GUI issue when a user actually logs in? If this shows up from a user login (and it might not, I only know so far the popup hits during boot), is that user (the one logging in) listed in the output of “grep video '/etc/group'” (meaning user “heifu”)?

Incidentally, I think the permissions of the driver in “/dev/tegra_dc_*” is correct, and the issue is about either the account accessing this, or from the timing of access being too early (this is separate from the exec format error issue). Typically there are some GPU operations which require either using “sudo”, or alternatively, adding the user to group “video” (which would make the user show up in “/etc/group” under “video”; normal boot scripts, prior to a user login, are typically run as root).

Note that the “exec format error” is a separate issue, though perhaps indirectly related. The dmesg log does show what is probably the particular moment when the popup occurs:

[    0.903753] tegradc 15200000.nvdisplay: hdmi: invalid prod list prod_list_hdmi_board
[    0.903756] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

HDMI is “plug-n-play”, which means the system is able to probe the monitor and ask the monitor for its specs. This is done using the i2c protocol on the wire named “DDC”, and is known as the EDID data. The Jetson itself provides the power to the monitor’s i2c circuitry, which means the monitor can be queried even if it is turned off. The above mentioned error says there is something about the HDMI’s response which is not valid. Can you say more about whether this monitor is something “standard”, or if there is anything unusual about it? If you first go into a root shell via “sudo -s”, then what do you see from:

find /sys -name '*edid*'
egrep -i '*' `find /sys -iname 'edid'`

Can you attach a copy of “/etc/profile.d/jetson_env.sh” here?

Regarding the exec format error, this involves whatever device the qmi_wwan_simcom.ko module is for. Presumably this is the “install.sh” script’s owner hardware needs, and is the 5G module. This module seemingly has no HDMI involvement, and it is not possible for “qmi_wwan_simcom.ko” to load or function if it is the wrong architecture, but it might have some indirect influence. I say this because in kernel space something wrong with one driver can possibly interfere with another driver even when they are not related. Your 5G device might even have more than one driver, but the “qmi_wwan_simcom.ko” driver has no possibility of working if it is the wrong architecture; if there is some other kernel space driver which needs “qmi_wwan_simcom.ko” to function, then possibly the attempt to use a non-existent driver could have some odd effect.

First though we should see about EDID information I mentioned in the previous paragraph. If we know the monitor is valid, then such a popup is likely unrelated and we can move on to the exec format error. Btw, can you try a different monitor and see if this error goes away in the following command?
dmesg | egrep -i '(invalid prod list|tegra_hdmi_tmds_range_read)'

And yes, mixing a Nano driver with a TX2 NX might cause problems, but they are both the same architecture (arm64/aarch64), and so I would not expect an exec format error (the module might fail to load, but this would not be the error unless the Nano driver is really a 32-bit compatibility mode ARMv7-a driver and was never recompiled for ARMv8-a).

32-bit is ARMv7-a, while 64-bit is ARMv8-a. The former is typically known as armhf, while the latter is typically known as arm64 or aarch64. A desktop PC might be known as amd64 or x86_64. The binary code for one is not compatible on the other and they will refuse to run. If you compile for a PC natively on the PC, then you get what you could call amd64/x86_64. If you compile on a TX2, natively, then you would get arm64/aarch64. If you compile natively on an older 32-bit Jetson TK1, then you would get armhf. Trying to run a different type of executable for a different architecture on the TX2 causes an exec format error.

Often people will go look for drivers on the internet for some device. Many downloads assume a desktop PC and provide amd64/x86_64. If you put that on a Jetson and try to load it, then it says “exec format error”. Alternatively, if you compiled directly on the TX2, but the architecture were specified instead of allowing default native tools, then you would also get exec format error. If you compiled on a desktop PC, but used cross tools designed to output arm64, then it should work.

Incidentally, because NVIDIA documents always give instructions for building kernel modules as a cross compile, whereby the cross tools run on amd64/x86_64, but produce modules which are for arm64, people tend to mix those up with the correct instructions when compiling natively on the Jetson. If you cross compile from a PC you would specify “ARCH=arm64”, but if you natively compile on a TX2 and say “ARCH=arm64”, then you will get a surprising result that running on the Jetson claims it is the wrong architecture. The Jetson is arm64, but the native tools do the right thing when you completely leave out “ARCH=arm64” (I don’t know why that is).

Something regarding the install.sh script is using invalid compile options. Both module “option.ko” and “qmi_wwan_simcom.ko” were built incorrectly (not because they failed to build, but instead because they were built for the wrong architecture):

modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

Hi.

jetson_env.sh (1.3 KB)

This is a little hard to verify. For our purposes, we want to boot as fast as possible and without any user interaction, so we disabled login prompt (there’s no GUI for user login, goes straight to desktop). I can only assure you (as you already asserted) that it happens during boot. The user logs in automatically, and in the output of grep video '/etc/group' it’s indeed the user “heifu”.

I think I understand your second paragraph. Can it still be a timing issue if the user “heifu” already belongs to the video group? Which I’ve confirmed it does belong there.

Not sure what to say, it’s a pretty standard monitor I guess. Sorry I can’t be much help there. The commands and their outputs are bellow:

/sys/kernel/debug/tegradc.1/edid
/sys/kernel/debug/tegradc.0/edid
/sys/bus/i2c/drivers/tegra_edid
/sys/module/drm/parameters/edid_fixup
/sys/kernel/debug/tegradc.1/edid:No EDID
/sys/kernel/debug/tegradc.0/edid: 00 ff ff ff ff ff ff 00 1e 6d 55 5b 01 01 01 01
/sys/kernel/debug/tegradc.0/edid: 01 1a 01 03 80 30 1b 78 ea 31 35 a5 55 4e a1 26
/sys/kernel/debug/tegradc.0/edid: 0c 50 54 a5 4b 00 71 4f 81 80 95 00 b3 00 a9 c0
/sys/kernel/debug/tegradc.0/edid: 81 00 81 c0 90 40 02 3a 80 18 71 38 2d 40 58 2c
/sys/kernel/debug/tegradc.0/edid: 45 00 e0 0e 11 00 00 1e 00 00 00 fd 00 38 4b 1e
/sys/kernel/debug/tegradc.0/edid: 55 12 00 0a 20 20 20 20 20 20 00 00 00 fc 00 4c
/sys/kernel/debug/tegradc.0/edid: 47 20 46 55 4c 4c 20 48 44 0a 20 20 00 00 00 ff
/sys/kernel/debug/tegradc.0/edid: 00 0a 20 20 20 20 20 20 20 20 20 20 20 20 01 61
/sys/kernel/debug/tegradc.0/edid: 02 03 1b f1 48 90 04 03 01 12 1f 10 13 23 09 07
/sys/kernel/debug/tegradc.0/edid: 07 83 01 00 00 65 03 0c 00 10 00 02 3a 80 18 71
/sys/kernel/debug/tegradc.0/edid: 38 2d 40 58 2c 45 00 e0 0e 11 00 00 1e 2a 44 80
/sys/kernel/debug/tegradc.0/edid: a0 70 38 27 40 30 20 35 00 e0 0e 11 00 00 1e 01
/sys/kernel/debug/tegradc.0/edid: 1d 00 72 51 d0 1e 20 6e 28 55 00 e0 0e 11 00 00
/sys/kernel/debug/tegradc.0/edid: 1e 8c 0a d0 8a 20 e0 2d 10 10 3e 96 00 e0 0e 11
/sys/kernel/debug/tegradc.0/edid: 00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00
/sys/kernel/debug/tegradc.0/edid: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4b

I’m not sure how to interpret any of this, I’ll leave that to you.

Attached at the beginning of the post.

I tested on different monitors, though they are all the same model and same specs (it’s the only company monitors available), and everytime we had the popup during boot, that command yielded:

dmesg | egrep -i '(invalid prof list | tegra_hdmi_tmds_range_read)'
[    0.860814] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed
[    0.964339] tegradc 15210000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

This is very helpful. So, the exec format error should trigger from an executable for a different architecture. I’m not sure if we installed directly the driver (rather than compiling it first and then installing), but I can try to compile the whole module directly inside the TX2 and then install it, hopefully reassuring it’s in the right architecture.

Once again, very helpful to know. That is very interesting.

So adding to what I said before, I will look through the install script and see what It is exactly doing, I’m going to look for a cross compile flag, disable it and compile it with native tools on the TX2 NX.

I guess my next move is to try to fix the Exec format error, as you’ve hinted before.

That was very insightful and a lot of good information. Thank you for your time. I answered some of the questions you left and will update once I have made any progress regarding the Exec format error.

Francisco

The script “jetson_env.sh” is from JTOP. I am going to guess that if you remove JTOP, then the popup will go away. I don’t know the exact reason it is giving a permission denied error, but it wouldn’t be unusual for something like JTOP to monitor a file in “/sys” or “/proc”, and some of those change depending on kernel version or configuration…possibly it just needs some sort of update to deal with such a change.

For reference, your EDID from the monitor was valid. You can explore what it reports by pasting into http://www.edidreader.com/:

00 ff ff ff ff ff ff 00 1e 6d 55 5b 01 01 01 01
01 1a 01 03 80 30 1b 78 ea 31 35 a5 55 4e a1 26
0c 50 54 a5 4b 00 71 4f 81 80 95 00 b3 00 a9 c0
81 00 81 c0 90 40 02 3a 80 18 71 38 2d 40 58 2c
45 00 e0 0e 11 00 00 1e 00 00 00 fd 00 38 4b 1e
55 12 00 0a 20 20 20 20 20 20 00 00 00 fc 00 4c
47 20 46 55 4c 4c 20 48 44 0a 20 20 00 00 00 ff
00 0a 20 20 20 20 20 20 20 20 20 20 20 20 01 61
02 03 1b f1 48 90 04 03 01 12 1f 10 13 23 09 07
07 83 01 00 00 65 03 0c 00 10 00 02 3a 80 18 71
38 2d 40 58 2c 45 00 e0 0e 11 00 00 1e 2a 44 80
a0 70 38 27 40 30 20 35 00 e0 0e 11 00 00 1e 01
1d 00 72 51 d0 1e 20 6e 28 55 00 e0 0e 11 00 00
1e 8c 0a d0 8a 20 e0 2d 10 10 3e 96 00 e0 0e 11
00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4b

Because EDID is valid I don’t know what the earlier issue was from this error:

[    0.903753] tegradc 15200000.nvdisplay: hdmi: invalid prod list prod_list_hdmi_board
[    0.903756] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

It is possible that this is simply a monitor has something unexpected. Not sure if it is an issue or not. If the monitor works, then I guess just ignore this. If not, then that might be a new thread. On the other hand we can verify that it isn’t the monitor causing the popup.

After removing JTOP see if the popup goes away. Perhaps there is an updated version specifically for the TX2 if you need JTOP.

Hi.

I still haven’t reached any conclusion on the Exec format error, Among other work related things, I’ve been trying to understand why and how I can compile it in the right architecture, so I don’t really have an update on that.

I’m not sure if the problem is related to ‘cross compiling’ vs ‘using native tools’ because this script looks like it was made to run directly on a Jetson, as it both compiles and installs the drivers (because it also installs the drivers, i’m assuming it is intended to be executed directly on the Jetson). If this was the problem somehow, I fell like this would’ve been a major oversight that would’ve already been fixed. Though, again, this is a TX2 NX and these are meant for NANO, so I’m not sure how the behavior would differ there.

Although, an interesting find is, after completely removing JTOP from our jetson, neither of those errors mentioned in the first post are happening, and surprisingly enough, where there’s coverage, we can connect to 5G network. And the drivers seem to be installed (?) since I can see them in their respective dirs that are present in the install.sh script provided.

They do make it clear in the documentation that the system kernel should be 4.9.140-tegra, while ours is 4.9.253, since we’re in the same major version version, I didn’t think much of it but it could be a problem aswell.

I still feel like there’s a problem because of the Exec format error messages, despite the 5G module actually working, and uninstalling JTOP might’ve just suppress some superficial error. Not sure if I’m correct or not, but I’ll keep updating as I go.


I am at a point where I don’t understand why the drivers option and wmi_wwan_smcom give Exec format error.
When I check the info for each driver, it looks like (to me) it’s the same version, so for example:
Output of uname -a:
Linux heifu-tx2-nx 4.9.253-tegra #1 SMP PREEMPT Fri Nov 5 15:14:33 WET 2021 aarch64 aarch64 aarch64 GNU/Linux

Output of modinfo qmi_wwan_simcom

version:        Simcom_Linux_QMI_WWAN_Driver_V1.0
license:        GPL
description:    Qualcomm MSM Interface (QMI) WWAN driver
author:         Bjørn Mork <bjorn@mork.no>
srcversion:     29480A58515EB2DB048A6C6
alias:          usb:v1E0Ep9001d*dc*dsc*dp*ic*isc*ip*in05*
depends:        cdc-wdm
vermagic:       4.9.253-tegra SMP preempt mod_unload modversions aarch64

Output of modinfo option

filename:       /lib/modules/4.9.253-tegra/kernel/drivers/usb/serial/option.ko
license:        GPL
description:    USB Driver for GSM modems
author:         Matthias Urlichs <smurf@smurf.noris.de>
alias: ...
( ... )
alias: ....
depends:        usb_wwan
vermagic:       4.9.253-tegra SMP preempt mod_unload modversions aarch64

So, unless I’m interpreting this wrong, it feels like it’s being compile against the same architecture? Not sure if I can infer that from this though, might need some more in depth information on that.
But pretty much I’m out of ideas on why it gives that error for each driver.

If this is native compile, then specifying ARCH=arm64 would seem to be harmless since the ARCH is already arm64, but this is a subtle failure. I don’t know why, but if you do specify ARCH, despite this being the same as the native system, this tends to cause the native system to call this a foreign architecture (it probably shouldn’t, but this is my experience). Make sure on native compiles that you never use “ARCH=arm64” since this can have a different result than not specifying “ARCH=arm64” on an arm64 system. Having an exec format error is an absolute guarantee the file cannot be used without an emulator or other special means of execution. Your system thinks the file is cross compiled for a different architecture than what is native. Perhaps it is related to thinking different loading/linking tools are required, but that is just guessing.

Much will depend on the actual build steps for the offending “exec format error” in combination with the environment it was built in.

Using the wrong kernel can break things. Not nearly enough is known to say if substituting 4.9.253 instead of 4.9.140 will matter. If software is compiled to work with the other kernel, then it won’t be a problem. Or if the changes from 4.9.140 to 4.9.253 left an intact API/ABI, then this too won’t matter.

I have to emphasize that exec format error is an absolute guarantee of failure. The CPU thinks this cannot run on it. It won’t even try. This is entirely a build issue; either native build specified it such that it thinks it is foreign, or else it was actually a foreign build without cross tools to make it into the native architecture. Should this be a build issue on a native environment, then it is likely a simple fix related to removing something which was originally for foreign cross build. “uname -a” tells you about the local system, and whatever that file is, the system does not think it is compatible. “option” apparently depends on “usb_wwan”, but this cannot be loaded, thus “option” fails.

Just to emphasize, it won’t matter if this is the right architecture if some detail in build treats this as if it is foreign and causes native to believe this is foreign.

1 Like

Hi.

I won’t address everything you said but know that I think understood most of, if not all of what you are saying and I’m keeping that in mind as I go. There was a lot of misunderstanding on my part on how these things worked, so thank you very much for the in depth clarifications. Sorry if we’re being redundant at times, but it’s being really valuable to me.

Now, you obviously highlighted this:

I understand what you mean, as in, if you compile in the native system, but still cross compile to the same system, it will recognize it as being another architecture, despite being compiled “for itself” essentially.

I’ve been going through the Makefiles for each driver that we mention previously, looking for a cross compilation flag. One thing I didn’t know is that modules are compiled with a makefile that exists within /lib/modules/$(uname -r)/build. And it’s interesting because I always found it weird that the makefile provided for the individual drivers didn’t call a compiler like gcc, but rather another make, which didn’t make much sense until I noticed that it changes directories first and then makes the module inside /lib/modules/$(uname -r)/build.

So let’s say, this makefile is provided by waveshare to compile the option driver (I omited the clean target for breviety):

obj-m:=option.o
optionmodule-objs:=module
KDIR:=/lib/modules/$(shell uname -r)/build
MAKE:=make
default:
	$(MAKE) -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) modules

So this will actually build with the makefile in /lib/modules/$(uname -r)/build/Makefile. That has a lot of makefile code that I can skim through but can’t say I fully understand (as it is very long as well), but I do see some references of cross compilation support. Because there’s a lot of code, I can’t really understand if any cross compilation flag is being set within the makefile, even without being clearly specified in the make command.

One thing that I’m assuming throughout this conversation is that the “exec format error” comes specifically because it was built correctly but the native system recognizes it as a foreign architecture, whether or not it was compiled with native tools, if ARCH=arm64 was given, for example. And because of this, quoting you : “The CPU thinks this cannot run on it. It won’t even try

Is this the always the case for “exec format error” or can there be anything else you might think of?

This question is because I’m currently just trying to find a way, in here /lib/modules/$(uname -r)/build/Makefile where it is somehow cross compiling the drivers, despite me not finding any apparent flags like ARCH=arm64. I do see some references to an $ARCH variable that is assigned through this command:

SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
				  -e s/sun4u/sparc64/ \
				  -e s/arm.*/arm/ -e s/sa110/arm/ \
				  -e s/s390x/s390/ -e s/parisc64/parisc/ \
				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
				  -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ )

Which in my case would output arm64. So if that variable makes it’s way into the actual compilation of the module, it would be always “cross compiling” the driver (?) Not really sure if what I’m saying is correct, but it’s what I’m interpreting.

Though, if this observation of yours (that cross compiling for the same architecture results in a “foreign architecture” is correct, which I’m assuming it is), I’m not sure why that makefile would cross compile by default, without any flags specified, so I’m not sure if my, let’s call it theory that it always cross compiles, is correct.
Not sure if what I said made sense about “cross compiling by default” but if I didn’t know that cross compiling with the same native tools would result in a foreign architecture, I personally would think it would be a reasonable thing to do, to ensure a specific architecture. So, just take this paragraph as my observations on the matter and what is going through my head, nothing necessarily that I’ve specifically observed, just kind of thinking out lout.


So I guess, to finish with some actual questions:

  • Is exec format error specific to a foreign arquitecture, or it can be triggered by a different situation?
  • Could the actual source implementation of the driver, somehow, specify a different architecture such that, once compiled natively, it’s still meant for a specific system, not necessarily the one it was compiled on?
  • Can I, somehow, assure that a makefile doesn’t cross compile? (I’ve tried to unset ARCH variable within the makefile but it would probably be a much deeper solution that doing just that).

Any information, even if indirectly related but still useful for this discussion, I’d greatly appreciate .

Thank you very much for your help so far.

This sounds like what would happen if you were building out-of-tree modules. Normally, if you have the full source tree, then that Makefile would not be consulted. Do you have any third party code copied in, without using the downloaded full kernel source from the NVIDIA website? It sounds like this is what the waveshare is. If so, then you might need to set up the symbolic links in “/lib/modules/$(uname -r)” to point at your manual content which is from the NVIDIA download and configured to match the running kernel (which would cause it to use the right Makefile and NVIDIA content). What you wouldn’t want to do is edit this manually since it is (A) easier to just set up NVIDIA source, and (B) there are so many edits.

This is correct:

Yes, this is always the case. “exec format error” implies the CPU won’t use the code of another completely different processor. The instruction sets are not the same. It is just binary noise so far as that CPU is concerned.

Basically, examine that there are symbolic links in “/lib/modules/$(uname -a)/”. See where they point. This is what packages for building modules against a system normally look at when they don’t have full kernel source…this is out-of-tree build infrastructure. Now image that we unpack the full kernel source (if no .deb package is available which matches the existing system, and NVIDIA is customized, but I’m not sure if such a header-only package is available for this…in the past it was not, and apparently what is there now is not customized for this NVIDIA release). Now assume that source exists at “/usr/src/sources/kernel/kernel-4.9”. Imagine that within this source you’ve run the correct “sudo make tegra_defconfig” (I use sudo because this should be owned only by root, readable by others). Then imagine you have set the generated .config variable CONFIG_LOCALVERSION to “-tegra”. This source could be used to build against the running kernel.

Now take this a step further, and change the symbolic links in “/lib/modules/$(uname -r)” to point to the equivalents in “/usr/src/sources/kernel/kernel-4.9”. Your outside module should build correctly against this.

I don’t know what is used to check architecture before load. I am only guessing that it has some “magic” written somewhere, and specifying ARCH from the build. Perhaps, since it was not expected to use ARCH in combination with cross tools, that this “magic” was never written. Don’t know, but I have had this cause this failure in the past, so I know something is not right. Maybe native tools don’t know to write this information when used as a cross compiler since most of the time it would be redundant.

My suggestion is to first make sure you have enough disk space. Then install source at “/usr/src/kernel/kernel-4.9” (or similar). Configure this as if you are going to build everything there, including CONFIG_LOCALVERSION. Change thoses symbolic links to there. Try building your out-of-tree content again against this.

Hi.

I’ve been trying to install them by compiling using two different methods, which I will explain further. Both compile the modules successfully but still yield exec format error when attempting to add them.

First is what you’ve suggested:

I was able to find a thread where you gave a more detailed explanation on how to do this (errror executing modules prepare)

So, I’ve installed the sources in /usr/sr via ./source_sync.sh -k tegra-l4t-r32.1. Now I have the full kernel source at /usr/src/sources/kernel/kernel-4.9.
In /lib/modules/$(uname -r) I changed where build points to, to point to the full kernel source. Looks something like this:

heifu@heifu-tx2-nx:/lib/modules/4.9.253-tegra 
$ ls -l
total 1292
lrwxrwxrwx 1 root root     34 Oct  3 11:01 build -> /usr/src/sources/kernel/kernel-4.9
(...)  

Now, inside build (which is pointing to the full kernel sources), I did:

sudo make mrproper
sudo make tegra_defconfig 

I’m not sure the correct way of setting the LOCALVERSION, so I tried two different ways, one was simply export LOCALVERSION=-tegra, and the other way was to manually edit the generated .config file and change CONFIG_LOCALVERSION to CONFIG_LOCALVERSION="-tegra".
Either way, after setting CONFIG_LOCALVERSION to -tegra, I did sudo make modules_prepare. No problems until here.
Now I run this very straight forward install script (also provided by waveshare but with slight changes to just make and install both modules):

cd option
make
mv /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option_bk.ko
cp option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/
cd ..

cd qmi_wwan_simcom
make
cp qmi_wwan_simcom.ko /lib/modules/$(uname -r)/kernel/drivers/net/usb
cd ..

depmod
modprobe option
modprobe qmi_wwan_simcom
modprobe -r qmi_wwan_simcom
modprobe qmi_wwan_simcom

Which outputs (hopefully it’s not too big of a text dump):

make -C /lib/modules/4.9.253-tegra/build -I ./usb_wwan SUBDIRS=/home/heifu/Sim8200_for_jetsonnano/option modules
make[1]: Entering directory '/usr/src/sources/kernel/kernel-4.9'

  WARNING: Symbol version dump ./Module.symvers
           is missing; modules will have no dependencies and modversions.

  CC [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/option/option.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.ko
make[1]: Leaving directory '/usr/src/sources/kernel/kernel-4.9'
rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions Module.* modules.order
make -C /lib/modules/4.9.253-tegra/build M=/home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom modules
make[1]: Entering directory '/usr/src/sources/kernel/kernel-4.9'

  WARNING: Symbol version dump ./Module.symvers
           is missing; modules will have no dependencies and modversions.

  CC [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.ko
make[1]: Leaving directory '/usr/src/sources/kernel/kernel-4.9'
modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

There’s a warning for each module build, not sure how important or relevant it is for the process.
It is indeed building with the full sources, as far as I can tell, and also I am not (intentionally) cross compiling it anywhere. I’m really not sure at this point if I setup something wrong or if there’s another problem.
If, for example, i run the cmd file for one of the modules I get:

file option.ko 
option.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=dd94366b76bb65de18c801b075540f05d4f5d52d, with debug_info, not stripped

Which, unless I’m overlooking something, tell’s me it is compiled for the correct architecture, which would imply that ARCH=arm64 was specified at some point for this not to work.
I’m really not sure how to go on from here.

For the second method

Here I tried to actually cross compile the modules, but now on my actual work machine, which is linux 20.04, x86_64. My idea was, if it was somehow cross compiling inside the jetson with native tools, let’s try to actually cross compile it from a different architecture.
So I basically did the same, I setup the environment for cross compiling (which I’ve done before for compiling the actual kernel sources). I setup a directory with the full kernel sources, ran the same setup, same commands (only now I’m explicitly providing the env variables CROSS_COMPILE=/usr/bin/aarch64-linux-gnu- ARCH=arm64 LOCALVERSION=-tegra.
It compiles with some warnings (Not sure how important they are, i’ll just attach the compile logs):
crosscompile_logs.txt (5.1 KB)
and I confirmed the module is indeed for aarch64

file option.ko 
option.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=94c395ed44a856871b01bf3d7e71adcdce1340e0, with debug_info, not stripped

So I moved these modules, cross compiled in my x86_64 machine, to the jetson, tried to install them but still I get the exec format error.
I’m not sure if this would work (though in my mind it should, since now, while I’m cross compiling, i’m not doing it with native tools so, from what I understood from you, should’ve been fine) but still couldn’t do it.

I insisted more on the first method, since it’s what you’ve actually recommended but I can’t get it to work. Not sure if you can spot something I did wrong or might’ve missed but either way, I think I still need some help here.

Hopefully this makes sense, I’m not very knowledgeable here but I think I’ve been understanding your points so far, but I’m still not able to get this to work.

Thank you so much for your help so far.
Francisco.

This is not in a particular order, I’m just adding notes as I read your reply. Each reply note might be before reading all of your reply.

Beware that although source_sync.sh is probably ok for a TX2 that there are some releases which might need to be downloaded directly from the web site for the specific L4T release. Basically though I think this should be ok for your case, along with the change of the symbolic link.

The mrproper and tegra_defconfig inside the full sources is correct. However, for modules to see this, you would also need to provide a “modules_prepare” step if no kernel Image is built first. Before you do that be sure to set CONFIG_LOCALVERSION. I happen to know this has no dependencies, and so it is ok to directly edit the produced “.config” file, or to use a config editor (e.g., menuconfig or nconfig). If you go to the “$TOP” (in your case “/usr/src/sources/kernel/kernel-4.9”), then you can edit the file produced during the “tegra_defconfig” step (perhaps the “/proc/config.gz”, after decompression and renaming to “.config” would be a better choice, but initially this should be no different than “tegra_defconfig”) to have this:
CONFIG_LOCALVERSION="-tegra"

One could also use the command line during the make, although I have not personally used this. I’m not positive, but I think this is the equivalent:
make LOCALVERSION=-tegra ...
(which is equivalent to what you suggested for “export LOCALVERSION=-tegra”)

Once this directory is set up with complete configuration and modules_prepare, you would then want to always use the “O=/some/where” alternate intermediate build location (and you’d still configure at the alternate location, but this would not modify “$TOP”) and not build with sudo. Mainly the configuration which the symbolic link “build” points to is for externally compiled modules which reference the running system’s configuration. If you were to build a full kernel or all modules, then the “O=/some/where” would still have its own independent configuration.

Does this waveshare script have its own kernel module source? If this script is building a standard module, versus building an out-of-tree module which is not in the original kernel source, it changes what is needed. If this were simply building in-tree (something existing as part of the default kernel source), then it wouldn’ t need any special script and I’d suspect the script is doing something to cause the false “exec format error” issue.

I’m thinking I saw that this waveshare source wants to replace the stock option.ko with its own version, in which case it would be valid to provide their own source and build it against the source which is mostly configured to match your running system. I say “mostly” because it might be necessary to do something odd to remove the existing “option.ko” module spec and instead build the external source using the same file name. This is where it gets confusing and I’m not sure what is going on in this step, but moving option.ko to a backup (option_bk.ko) says this is what is likely happening.

If you were to try to use a module which does not have a correct CONFIG_LOCALVERSION, then the error would have been different than exec format error. Probably whatever is wrong has nothing to do with CONFIG_LOCALVERSION.

The warning about missing “Modules.symvers” might not matter, but the combination of “modules_prepare” and building the actual modules should generate this. You might try building actual modules with sudo in “$TOP” to get a Modules.symvers if this is an issue. Then build your out-of-tree module again (but like I mentioned, I don’t know if this is needed…an error from this is likely different than “exec format error”).

The “file option.ko” says the format is correct, so something in the metadata of the build is at issue. Perhaps it is the missing Module.symvers file, and building modules in “$TOP” would help…not sure, but it is easy to find out. Just make sure the option.ko you are inserting is not the one you compiled earlier with the exec format error.

As long as the compiler version is correct it shouldn’t matter if you cross compile from an Ubuntu 18.04 host PC versus 20.04. It is good to test this. I see another warning about missing Module.symvers, and so perhaps this is important.

On the host PC, what do you see from:
/usr/bin/aarch64-linux-gnu-gcc --version
(and did you install the cross compiler from the NVIDIA web site which goes with your L4T release?)

In every case though it seems this “exec format error” is during compile of this out-of-tree content.

Hi.

I’ve been referring to this document (https://www.kernel.org/doc/Documentation/kbuild/modules.txt) to help me better understand what I’m actually doing.
From that I see that Module.symvers doesn’t generate from the build target modules_prepare, hence why I was having the warning shown previously (I’m assuming).

Going from the beginning now:

I’ve now synced the sources with my exact version just to be sure (which I should’ve done initially but…). So I checked my tegra release and synced the sources with r32.6.1, which obviously now exactly matches our version (even though you mention it shouldn’t be a problem).

Within the sources I’ve done, in order:

  • make mrproper
  • make tegra_defconfig
  • Changed CONFIG_LOCALVERSION="-tegra" in the generated .config
  • make modules

The build target make modules took a much longer but it now generated the Module.symvers file, which indeed removed the previous warning when compiling both drivers. Despite this warning now gone, I still get the Exec format error when using modprobe on the drivers. I haven’t been able to find/figure out much more that I can do or try.

For each driver, I found out about insmod, which does indeed insert the modules (though from my understanding, it doesn’t manage any dependencies or any extra work that the modprobe does, so I’m not sure how good of a solution that would be). I’m not sure if this command would/should yield Exec format error of some sort if it didn’t recognize the drivers, but the fact that it doesn’t show any errors and I can inspect the driver with lsmod, only further confuses me in this situation. Why does insmod work without errors and modprobe doesn’t. As in, it should (?) show an error still if you try to insert a module compiled against a different architecture (how would it work otherwise), yet it does not, so why does modprobe?

This makes sense to me, why would it not work though, I’ve looked through the changes in the source code of option.c, and there’s nothing particularly weird about it. I’m guessing this is still simply a problem of me (somehow) building the sources incorrectly.

This is the output:

aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I didn’t give much thought about cross compiling it in the host PC, since I see no reason on why it shouldn’t work on the jetson itself. It’s more convenient to just develop on the jetson so I didn’t give much thought to that solution.

I am not sure honestly. Can you refer me to where I can find the different versions of the cross compiler? Can’t find the it.

I don’t know if you have any more ideas about this. Would it be much to ask if you can replicate any of this?

Thank you.