Error found in TX2 Boot: "nvdc: open: Permission denied" and ***NvRmMemInit failed***

Hi again. (BTW @jbalao is my coworker).

We were able to determine that these errors mentions in the first post start happening on jetsons where we installed the 5G driver mention previously by jbalao.

Now, since the errors happen when loading /etc/profile, we ran some tests.

In the script /etc/profile we commented the loop that runs all the *.sh scripts inside profile.d
Because this error didn’t show up every time, we can’t conclude anything with absolute certainty but since we commented all these scripts, we booted up dozens of times and didn’t see the error anymore.

Now, assuming it’s one of the scripts inside profile.d that is throwing up the problem, we looked through them and tried only commenting the ones that looked like more trouble.

After some tests, and by tests I mean commenting scripts and booting up the jetson a whole lot of times, we are almost certain that the script /etc/profile.d/jetson_env.sh is giving that problem.

What we now have is the /etc/profile running all scripts normally, except the jetson_env.sh script (since we renamed it to not match the *.sh pattern). We now have never had any more problems booting up and encountering any errors that I’ve shown in the first post.

My question is then, what does that script do and how essential it is ?

We understand that this is a temporary solution for a problem we need to look further into, since what causes this is the 5G driver and not the scripts in /etc/profile.d, but for now we want to know what that script in particular does, any clues to what might be happening,

Either way, it’s not exacly fixed but we are not getting any of those errors anymore with jetson_env.sh` disabled.

This isn’t in order, I’m just making notes as I go.

  • Does the command line prompt lock up after the lsmod or “ls /dev/tegra_dc_*”?
  • What is shown from the command:
    grep video /etc/group
  • Is it possible to get a “dmesg” log which occurs just after the failure is noticed (admittedly this won’t be possible if there is a communications failure and it is in the air)?
  • Even though this is on a drone, is there a possibility of accessing the Jetson during the failure even if it is just for testing without the drone being in the air?
  • The kernel module “mv /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option.ko” is being replaced by a third party module. It is possible this is compiled against a different kernel release or is in some way not compatible with this hardware if “option.ko” is used for anything other than this one device. Does the manufacturer make this module available as source code, or is it purely binary?
  • Similar for module “qmi_wwan_simcom.ko”. Is this available as source code, or is it just binary?
  • Actually, I see this did compile against the kernel headers, and so this is in source form, and it was compiled for this kernel, so disregard anything about it maybe not being built against this kernel. However: the compile log says this is built for the wrong architecture. This is what the exec format error is about"
modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

It is as if this was cross compiled incorrectly on a host PC and not on the Jetson. Were the install steps performed on a PC in any part at all?

Note that if something in “/etc/profile” needs the qmi_wwan_simcom module loaded, then this would be a mandatory failure. I think knowing why this has exec format error is key to solving it.

Hi.
To answer your questions I reverted the previously mention temporary solution:

So now we have the error popping up again. Now to answer your questions:

No. If by ‘lock’ you mean it gets stuck and unusable, it does not happen. We get the expected output. (I’m not sure if the output itself is relevant so I’ll omit it for brevity).

video:x:44:heifu,gdm,lightdm

While we are working on a drone, most, if not all, tests are done on bench, with just our motherboard connected to power supply and all relevant modules, you can assume the drone is not flying and we’re always able to access the jetson. (Even if it’s in the air, we have ways to do so). To answer the question, the “failure” is that popup during boot. The exact sequence of events is: jetson powerup → company logo → popup with error found when loading /etc/profile ... → nvidia logo → desktop.
So I can’t really give you the dmesg log just after the failure, but I’ll attach the full dmesg log anyway, in case it helps.
dmesg.txt (63.8 KB)

Something we didn’t mention that might’ve been important, these drivers are for the Jetson Nano, while we’re using a TX2 NX. The source code is available here: wget https://www.waveshare.net/w/upload/0/07/Sim8200_for_jetsonnano.7z, which you get from
from this url: https://www.waveshare.com/wiki/SIM8200EA-M2_5G_HAT# , down in Use with Jetson Nano.

We are not sure how the differences about using Jetson Nano drivers on TX2 NX but as you said, we did notice aswell that it was compiled for this kernel, hence why we’re so confused on why we get the Exec format error mentioned.

Can you explain this a little bit better for us? As in, which architecture was expected (if you can). Does this have to do with the fact it’s a TX2 NX instead of a Nano?

Also, just as a note, because we have the source code of the driver, we’re not afraid of looking through the code and try to make the necessary changes to fix this problem, if it is possible at all. If you look through the driver source code, can you spot anything that might explain the exec format errors?

We will focus on trying to understand this, if you think it’s the right direction.

I guess this is already a bit long post but one more question. Since we’re using drivers for Nano, what exacly differs from the TX2 NX, in terms of driver implementation? How far are we from changing the source implementation to, i guess, “fit” our environment?

I realize it’s a long post so, thank you for your patience and help.

Can you attach a copy of “/etc/profile.d/jetson_env.sh” here?

From what you say it seems this is an ordinary boot popup and there has not yet been any attempt of a user login. Can you verify this is correct? Is there any GUI issue when a user actually logs in? If this shows up from a user login (and it might not, I only know so far the popup hits during boot), is that user (the one logging in) listed in the output of “grep video '/etc/group'” (meaning user “heifu”)?

Incidentally, I think the permissions of the driver in “/dev/tegra_dc_*” is correct, and the issue is about either the account accessing this, or from the timing of access being too early (this is separate from the exec format error issue). Typically there are some GPU operations which require either using “sudo”, or alternatively, adding the user to group “video” (which would make the user show up in “/etc/group” under “video”; normal boot scripts, prior to a user login, are typically run as root).

Note that the “exec format error” is a separate issue, though perhaps indirectly related. The dmesg log does show what is probably the particular moment when the popup occurs:

[    0.903753] tegradc 15200000.nvdisplay: hdmi: invalid prod list prod_list_hdmi_board
[    0.903756] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

HDMI is “plug-n-play”, which means the system is able to probe the monitor and ask the monitor for its specs. This is done using the i2c protocol on the wire named “DDC”, and is known as the EDID data. The Jetson itself provides the power to the monitor’s i2c circuitry, which means the monitor can be queried even if it is turned off. The above mentioned error says there is something about the HDMI’s response which is not valid. Can you say more about whether this monitor is something “standard”, or if there is anything unusual about it? If you first go into a root shell via “sudo -s”, then what do you see from:

find /sys -name '*edid*'
egrep -i '*' `find /sys -iname 'edid'`

Can you attach a copy of “/etc/profile.d/jetson_env.sh” here?

Regarding the exec format error, this involves whatever device the qmi_wwan_simcom.ko module is for. Presumably this is the “install.sh” script’s owner hardware needs, and is the 5G module. This module seemingly has no HDMI involvement, and it is not possible for “qmi_wwan_simcom.ko” to load or function if it is the wrong architecture, but it might have some indirect influence. I say this because in kernel space something wrong with one driver can possibly interfere with another driver even when they are not related. Your 5G device might even have more than one driver, but the “qmi_wwan_simcom.ko” driver has no possibility of working if it is the wrong architecture; if there is some other kernel space driver which needs “qmi_wwan_simcom.ko” to function, then possibly the attempt to use a non-existent driver could have some odd effect.

First though we should see about EDID information I mentioned in the previous paragraph. If we know the monitor is valid, then such a popup is likely unrelated and we can move on to the exec format error. Btw, can you try a different monitor and see if this error goes away in the following command?
dmesg | egrep -i '(invalid prod list|tegra_hdmi_tmds_range_read)'

And yes, mixing a Nano driver with a TX2 NX might cause problems, but they are both the same architecture (arm64/aarch64), and so I would not expect an exec format error (the module might fail to load, but this would not be the error unless the Nano driver is really a 32-bit compatibility mode ARMv7-a driver and was never recompiled for ARMv8-a).

32-bit is ARMv7-a, while 64-bit is ARMv8-a. The former is typically known as armhf, while the latter is typically known as arm64 or aarch64. A desktop PC might be known as amd64 or x86_64. The binary code for one is not compatible on the other and they will refuse to run. If you compile for a PC natively on the PC, then you get what you could call amd64/x86_64. If you compile on a TX2, natively, then you would get arm64/aarch64. If you compile natively on an older 32-bit Jetson TK1, then you would get armhf. Trying to run a different type of executable for a different architecture on the TX2 causes an exec format error.

Often people will go look for drivers on the internet for some device. Many downloads assume a desktop PC and provide amd64/x86_64. If you put that on a Jetson and try to load it, then it says “exec format error”. Alternatively, if you compiled directly on the TX2, but the architecture were specified instead of allowing default native tools, then you would also get exec format error. If you compiled on a desktop PC, but used cross tools designed to output arm64, then it should work.

Incidentally, because NVIDIA documents always give instructions for building kernel modules as a cross compile, whereby the cross tools run on amd64/x86_64, but produce modules which are for arm64, people tend to mix those up with the correct instructions when compiling natively on the Jetson. If you cross compile from a PC you would specify “ARCH=arm64”, but if you natively compile on a TX2 and say “ARCH=arm64”, then you will get a surprising result that running on the Jetson claims it is the wrong architecture. The Jetson is arm64, but the native tools do the right thing when you completely leave out “ARCH=arm64” (I don’t know why that is).

Something regarding the install.sh script is using invalid compile options. Both module “option.ko” and “qmi_wwan_simcom.ko” were built incorrectly (not because they failed to build, but instead because they were built for the wrong architecture):

modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

Hi.

jetson_env.sh (1.3 KB)

This is a little hard to verify. For our purposes, we want to boot as fast as possible and without any user interaction, so we disabled login prompt (there’s no GUI for user login, goes straight to desktop). I can only assure you (as you already asserted) that it happens during boot. The user logs in automatically, and in the output of grep video '/etc/group' it’s indeed the user “heifu”.

I think I understand your second paragraph. Can it still be a timing issue if the user “heifu” already belongs to the video group? Which I’ve confirmed it does belong there.

Not sure what to say, it’s a pretty standard monitor I guess. Sorry I can’t be much help there. The commands and their outputs are bellow:

/sys/kernel/debug/tegradc.1/edid
/sys/kernel/debug/tegradc.0/edid
/sys/bus/i2c/drivers/tegra_edid
/sys/module/drm/parameters/edid_fixup
/sys/kernel/debug/tegradc.1/edid:No EDID
/sys/kernel/debug/tegradc.0/edid: 00 ff ff ff ff ff ff 00 1e 6d 55 5b 01 01 01 01
/sys/kernel/debug/tegradc.0/edid: 01 1a 01 03 80 30 1b 78 ea 31 35 a5 55 4e a1 26
/sys/kernel/debug/tegradc.0/edid: 0c 50 54 a5 4b 00 71 4f 81 80 95 00 b3 00 a9 c0
/sys/kernel/debug/tegradc.0/edid: 81 00 81 c0 90 40 02 3a 80 18 71 38 2d 40 58 2c
/sys/kernel/debug/tegradc.0/edid: 45 00 e0 0e 11 00 00 1e 00 00 00 fd 00 38 4b 1e
/sys/kernel/debug/tegradc.0/edid: 55 12 00 0a 20 20 20 20 20 20 00 00 00 fc 00 4c
/sys/kernel/debug/tegradc.0/edid: 47 20 46 55 4c 4c 20 48 44 0a 20 20 00 00 00 ff
/sys/kernel/debug/tegradc.0/edid: 00 0a 20 20 20 20 20 20 20 20 20 20 20 20 01 61
/sys/kernel/debug/tegradc.0/edid: 02 03 1b f1 48 90 04 03 01 12 1f 10 13 23 09 07
/sys/kernel/debug/tegradc.0/edid: 07 83 01 00 00 65 03 0c 00 10 00 02 3a 80 18 71
/sys/kernel/debug/tegradc.0/edid: 38 2d 40 58 2c 45 00 e0 0e 11 00 00 1e 2a 44 80
/sys/kernel/debug/tegradc.0/edid: a0 70 38 27 40 30 20 35 00 e0 0e 11 00 00 1e 01
/sys/kernel/debug/tegradc.0/edid: 1d 00 72 51 d0 1e 20 6e 28 55 00 e0 0e 11 00 00
/sys/kernel/debug/tegradc.0/edid: 1e 8c 0a d0 8a 20 e0 2d 10 10 3e 96 00 e0 0e 11
/sys/kernel/debug/tegradc.0/edid: 00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00
/sys/kernel/debug/tegradc.0/edid: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4b

I’m not sure how to interpret any of this, I’ll leave that to you.

Attached at the beginning of the post.

I tested on different monitors, though they are all the same model and same specs (it’s the only company monitors available), and everytime we had the popup during boot, that command yielded:

dmesg | egrep -i '(invalid prof list | tegra_hdmi_tmds_range_read)'
[    0.860814] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed
[    0.964339] tegradc 15210000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

This is very helpful. So, the exec format error should trigger from an executable for a different architecture. I’m not sure if we installed directly the driver (rather than compiling it first and then installing), but I can try to compile the whole module directly inside the TX2 and then install it, hopefully reassuring it’s in the right architecture.

Once again, very helpful to know. That is very interesting.

So adding to what I said before, I will look through the install script and see what It is exactly doing, I’m going to look for a cross compile flag, disable it and compile it with native tools on the TX2 NX.

I guess my next move is to try to fix the Exec format error, as you’ve hinted before.

That was very insightful and a lot of good information. Thank you for your time. I answered some of the questions you left and will update once I have made any progress regarding the Exec format error.

Francisco

The script “jetson_env.sh” is from JTOP. I am going to guess that if you remove JTOP, then the popup will go away. I don’t know the exact reason it is giving a permission denied error, but it wouldn’t be unusual for something like JTOP to monitor a file in “/sys” or “/proc”, and some of those change depending on kernel version or configuration…possibly it just needs some sort of update to deal with such a change.

For reference, your EDID from the monitor was valid. You can explore what it reports by pasting into http://www.edidreader.com/:

00 ff ff ff ff ff ff 00 1e 6d 55 5b 01 01 01 01
01 1a 01 03 80 30 1b 78 ea 31 35 a5 55 4e a1 26
0c 50 54 a5 4b 00 71 4f 81 80 95 00 b3 00 a9 c0
81 00 81 c0 90 40 02 3a 80 18 71 38 2d 40 58 2c
45 00 e0 0e 11 00 00 1e 00 00 00 fd 00 38 4b 1e
55 12 00 0a 20 20 20 20 20 20 00 00 00 fc 00 4c
47 20 46 55 4c 4c 20 48 44 0a 20 20 00 00 00 ff
00 0a 20 20 20 20 20 20 20 20 20 20 20 20 01 61
02 03 1b f1 48 90 04 03 01 12 1f 10 13 23 09 07
07 83 01 00 00 65 03 0c 00 10 00 02 3a 80 18 71
38 2d 40 58 2c 45 00 e0 0e 11 00 00 1e 2a 44 80
a0 70 38 27 40 30 20 35 00 e0 0e 11 00 00 1e 01
1d 00 72 51 d0 1e 20 6e 28 55 00 e0 0e 11 00 00
1e 8c 0a d0 8a 20 e0 2d 10 10 3e 96 00 e0 0e 11
00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4b

Because EDID is valid I don’t know what the earlier issue was from this error:

[    0.903753] tegradc 15200000.nvdisplay: hdmi: invalid prod list prod_list_hdmi_board
[    0.903756] tegradc 15200000.nvdisplay: hdmi: tegra_hdmi_tmds_range_read(bd) failed

It is possible that this is simply a monitor has something unexpected. Not sure if it is an issue or not. If the monitor works, then I guess just ignore this. If not, then that might be a new thread. On the other hand we can verify that it isn’t the monitor causing the popup.

After removing JTOP see if the popup goes away. Perhaps there is an updated version specifically for the TX2 if you need JTOP.

Hi.

I still haven’t reached any conclusion on the Exec format error, Among other work related things, I’ve been trying to understand why and how I can compile it in the right architecture, so I don’t really have an update on that.

I’m not sure if the problem is related to ‘cross compiling’ vs ‘using native tools’ because this script looks like it was made to run directly on a Jetson, as it both compiles and installs the drivers (because it also installs the drivers, i’m assuming it is intended to be executed directly on the Jetson). If this was the problem somehow, I fell like this would’ve been a major oversight that would’ve already been fixed. Though, again, this is a TX2 NX and these are meant for NANO, so I’m not sure how the behavior would differ there.

Although, an interesting find is, after completely removing JTOP from our jetson, neither of those errors mentioned in the first post are happening, and surprisingly enough, where there’s coverage, we can connect to 5G network. And the drivers seem to be installed (?) since I can see them in their respective dirs that are present in the install.sh script provided.

They do make it clear in the documentation that the system kernel should be 4.9.140-tegra, while ours is 4.9.253, since we’re in the same major version version, I didn’t think much of it but it could be a problem aswell.

I still feel like there’s a problem because of the Exec format error messages, despite the 5G module actually working, and uninstalling JTOP might’ve just suppress some superficial error. Not sure if I’m correct or not, but I’ll keep updating as I go.


I am at a point where I don’t understand why the drivers option and wmi_wwan_smcom give Exec format error.
When I check the info for each driver, it looks like (to me) it’s the same version, so for example:
Output of uname -a:
Linux heifu-tx2-nx 4.9.253-tegra #1 SMP PREEMPT Fri Nov 5 15:14:33 WET 2021 aarch64 aarch64 aarch64 GNU/Linux

Output of modinfo qmi_wwan_simcom

version:        Simcom_Linux_QMI_WWAN_Driver_V1.0
license:        GPL
description:    Qualcomm MSM Interface (QMI) WWAN driver
author:         Bjørn Mork <bjorn@mork.no>
srcversion:     29480A58515EB2DB048A6C6
alias:          usb:v1E0Ep9001d*dc*dsc*dp*ic*isc*ip*in05*
depends:        cdc-wdm
vermagic:       4.9.253-tegra SMP preempt mod_unload modversions aarch64

Output of modinfo option

filename:       /lib/modules/4.9.253-tegra/kernel/drivers/usb/serial/option.ko
license:        GPL
description:    USB Driver for GSM modems
author:         Matthias Urlichs <smurf@smurf.noris.de>
alias: ...
( ... )
alias: ....
depends:        usb_wwan
vermagic:       4.9.253-tegra SMP preempt mod_unload modversions aarch64

So, unless I’m interpreting this wrong, it feels like it’s being compile against the same architecture? Not sure if I can infer that from this though, might need some more in depth information on that.
But pretty much I’m out of ideas on why it gives that error for each driver.

If this is native compile, then specifying ARCH=arm64 would seem to be harmless since the ARCH is already arm64, but this is a subtle failure. I don’t know why, but if you do specify ARCH, despite this being the same as the native system, this tends to cause the native system to call this a foreign architecture (it probably shouldn’t, but this is my experience). Make sure on native compiles that you never use “ARCH=arm64” since this can have a different result than not specifying “ARCH=arm64” on an arm64 system. Having an exec format error is an absolute guarantee the file cannot be used without an emulator or other special means of execution. Your system thinks the file is cross compiled for a different architecture than what is native. Perhaps it is related to thinking different loading/linking tools are required, but that is just guessing.

Much will depend on the actual build steps for the offending “exec format error” in combination with the environment it was built in.

Using the wrong kernel can break things. Not nearly enough is known to say if substituting 4.9.253 instead of 4.9.140 will matter. If software is compiled to work with the other kernel, then it won’t be a problem. Or if the changes from 4.9.140 to 4.9.253 left an intact API/ABI, then this too won’t matter.

I have to emphasize that exec format error is an absolute guarantee of failure. The CPU thinks this cannot run on it. It won’t even try. This is entirely a build issue; either native build specified it such that it thinks it is foreign, or else it was actually a foreign build without cross tools to make it into the native architecture. Should this be a build issue on a native environment, then it is likely a simple fix related to removing something which was originally for foreign cross build. “uname -a” tells you about the local system, and whatever that file is, the system does not think it is compatible. “option” apparently depends on “usb_wwan”, but this cannot be loaded, thus “option” fails.

Just to emphasize, it won’t matter if this is the right architecture if some detail in build treats this as if it is foreign and causes native to believe this is foreign.

1 Like

Hi.

I won’t address everything you said but know that I think understood most of, if not all of what you are saying and I’m keeping that in mind as I go. There was a lot of misunderstanding on my part on how these things worked, so thank you very much for the in depth clarifications. Sorry if we’re being redundant at times, but it’s being really valuable to me.

Now, you obviously highlighted this:

I understand what you mean, as in, if you compile in the native system, but still cross compile to the same system, it will recognize it as being another architecture, despite being compiled “for itself” essentially.

I’ve been going through the Makefiles for each driver that we mention previously, looking for a cross compilation flag. One thing I didn’t know is that modules are compiled with a makefile that exists within /lib/modules/$(uname -r)/build. And it’s interesting because I always found it weird that the makefile provided for the individual drivers didn’t call a compiler like gcc, but rather another make, which didn’t make much sense until I noticed that it changes directories first and then makes the module inside /lib/modules/$(uname -r)/build.

So let’s say, this makefile is provided by waveshare to compile the option driver (I omited the clean target for breviety):

obj-m:=option.o
optionmodule-objs:=module
KDIR:=/lib/modules/$(shell uname -r)/build
MAKE:=make
default:
	$(MAKE) -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) modules

So this will actually build with the makefile in /lib/modules/$(uname -r)/build/Makefile. That has a lot of makefile code that I can skim through but can’t say I fully understand (as it is very long as well), but I do see some references of cross compilation support. Because there’s a lot of code, I can’t really understand if any cross compilation flag is being set within the makefile, even without being clearly specified in the make command.

One thing that I’m assuming throughout this conversation is that the “exec format error” comes specifically because it was built correctly but the native system recognizes it as a foreign architecture, whether or not it was compiled with native tools, if ARCH=arm64 was given, for example. And because of this, quoting you : “The CPU thinks this cannot run on it. It won’t even try

Is this the always the case for “exec format error” or can there be anything else you might think of?

This question is because I’m currently just trying to find a way, in here /lib/modules/$(uname -r)/build/Makefile where it is somehow cross compiling the drivers, despite me not finding any apparent flags like ARCH=arm64. I do see some references to an $ARCH variable that is assigned through this command:

SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
				  -e s/sun4u/sparc64/ \
				  -e s/arm.*/arm/ -e s/sa110/arm/ \
				  -e s/s390x/s390/ -e s/parisc64/parisc/ \
				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
				  -e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ )

Which in my case would output arm64. So if that variable makes it’s way into the actual compilation of the module, it would be always “cross compiling” the driver (?) Not really sure if what I’m saying is correct, but it’s what I’m interpreting.

Though, if this observation of yours (that cross compiling for the same architecture results in a “foreign architecture” is correct, which I’m assuming it is), I’m not sure why that makefile would cross compile by default, without any flags specified, so I’m not sure if my, let’s call it theory that it always cross compiles, is correct.
Not sure if what I said made sense about “cross compiling by default” but if I didn’t know that cross compiling with the same native tools would result in a foreign architecture, I personally would think it would be a reasonable thing to do, to ensure a specific architecture. So, just take this paragraph as my observations on the matter and what is going through my head, nothing necessarily that I’ve specifically observed, just kind of thinking out lout.


So I guess, to finish with some actual questions:

  • Is exec format error specific to a foreign arquitecture, or it can be triggered by a different situation?
  • Could the actual source implementation of the driver, somehow, specify a different architecture such that, once compiled natively, it’s still meant for a specific system, not necessarily the one it was compiled on?
  • Can I, somehow, assure that a makefile doesn’t cross compile? (I’ve tried to unset ARCH variable within the makefile but it would probably be a much deeper solution that doing just that).

Any information, even if indirectly related but still useful for this discussion, I’d greatly appreciate .

Thank you very much for your help so far.

This sounds like what would happen if you were building out-of-tree modules. Normally, if you have the full source tree, then that Makefile would not be consulted. Do you have any third party code copied in, without using the downloaded full kernel source from the NVIDIA website? It sounds like this is what the waveshare is. If so, then you might need to set up the symbolic links in “/lib/modules/$(uname -r)” to point at your manual content which is from the NVIDIA download and configured to match the running kernel (which would cause it to use the right Makefile and NVIDIA content). What you wouldn’t want to do is edit this manually since it is (A) easier to just set up NVIDIA source, and (B) there are so many edits.

This is correct:

Yes, this is always the case. “exec format error” implies the CPU won’t use the code of another completely different processor. The instruction sets are not the same. It is just binary noise so far as that CPU is concerned.

Basically, examine that there are symbolic links in “/lib/modules/$(uname -a)/”. See where they point. This is what packages for building modules against a system normally look at when they don’t have full kernel source…this is out-of-tree build infrastructure. Now image that we unpack the full kernel source (if no .deb package is available which matches the existing system, and NVIDIA is customized, but I’m not sure if such a header-only package is available for this…in the past it was not, and apparently what is there now is not customized for this NVIDIA release). Now assume that source exists at “/usr/src/sources/kernel/kernel-4.9”. Imagine that within this source you’ve run the correct “sudo make tegra_defconfig” (I use sudo because this should be owned only by root, readable by others). Then imagine you have set the generated .config variable CONFIG_LOCALVERSION to “-tegra”. This source could be used to build against the running kernel.

Now take this a step further, and change the symbolic links in “/lib/modules/$(uname -r)” to point to the equivalents in “/usr/src/sources/kernel/kernel-4.9”. Your outside module should build correctly against this.

I don’t know what is used to check architecture before load. I am only guessing that it has some “magic” written somewhere, and specifying ARCH from the build. Perhaps, since it was not expected to use ARCH in combination with cross tools, that this “magic” was never written. Don’t know, but I have had this cause this failure in the past, so I know something is not right. Maybe native tools don’t know to write this information when used as a cross compiler since most of the time it would be redundant.

My suggestion is to first make sure you have enough disk space. Then install source at “/usr/src/kernel/kernel-4.9” (or similar). Configure this as if you are going to build everything there, including CONFIG_LOCALVERSION. Change thoses symbolic links to there. Try building your out-of-tree content again against this.

Hi.

I’ve been trying to install them by compiling using two different methods, which I will explain further. Both compile the modules successfully but still yield exec format error when attempting to add them.

First is what you’ve suggested:

I was able to find a thread where you gave a more detailed explanation on how to do this (errror executing modules prepare)

So, I’ve installed the sources in /usr/sr via ./source_sync.sh -k tegra-l4t-r32.1. Now I have the full kernel source at /usr/src/sources/kernel/kernel-4.9.
In /lib/modules/$(uname -r) I changed where build points to, to point to the full kernel source. Looks something like this:

heifu@heifu-tx2-nx:/lib/modules/4.9.253-tegra 
$ ls -l
total 1292
lrwxrwxrwx 1 root root     34 Oct  3 11:01 build -> /usr/src/sources/kernel/kernel-4.9
(...)  

Now, inside build (which is pointing to the full kernel sources), I did:

sudo make mrproper
sudo make tegra_defconfig 

I’m not sure the correct way of setting the LOCALVERSION, so I tried two different ways, one was simply export LOCALVERSION=-tegra, and the other way was to manually edit the generated .config file and change CONFIG_LOCALVERSION to CONFIG_LOCALVERSION="-tegra".
Either way, after setting CONFIG_LOCALVERSION to -tegra, I did sudo make modules_prepare. No problems until here.
Now I run this very straight forward install script (also provided by waveshare but with slight changes to just make and install both modules):

cd option
make
mv /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/option_bk.ko
cp option.ko /lib/modules/$(uname -r)/kernel/drivers/usb/serial/
cd ..

cd qmi_wwan_simcom
make
cp qmi_wwan_simcom.ko /lib/modules/$(uname -r)/kernel/drivers/net/usb
cd ..

depmod
modprobe option
modprobe qmi_wwan_simcom
modprobe -r qmi_wwan_simcom
modprobe qmi_wwan_simcom

Which outputs (hopefully it’s not too big of a text dump):

make -C /lib/modules/4.9.253-tegra/build -I ./usb_wwan SUBDIRS=/home/heifu/Sim8200_for_jetsonnano/option modules
make[1]: Entering directory '/usr/src/sources/kernel/kernel-4.9'

  WARNING: Symbol version dump ./Module.symvers
           is missing; modules will have no dependencies and modversions.

  CC [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/option/option.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/option/option.ko
make[1]: Leaving directory '/usr/src/sources/kernel/kernel-4.9'
rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions Module.* modules.order
make -C /lib/modules/4.9.253-tegra/build M=/home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom modules
make[1]: Entering directory '/usr/src/sources/kernel/kernel-4.9'

  WARNING: Symbol version dump ./Module.symvers
           is missing; modules will have no dependencies and modversions.

  CC [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.mod.o
  LD [M]  /home/heifu/Sim8200_for_jetsonnano/qmi_wwan_simcom/qmi_wwan_simcom.ko
make[1]: Leaving directory '/usr/src/sources/kernel/kernel-4.9'
modprobe: ERROR: could not insert 'option': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error
modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

There’s a warning for each module build, not sure how important or relevant it is for the process.
It is indeed building with the full sources, as far as I can tell, and also I am not (intentionally) cross compiling it anywhere. I’m really not sure at this point if I setup something wrong or if there’s another problem.
If, for example, i run the cmd file for one of the modules I get:

file option.ko 
option.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=dd94366b76bb65de18c801b075540f05d4f5d52d, with debug_info, not stripped

Which, unless I’m overlooking something, tell’s me it is compiled for the correct architecture, which would imply that ARCH=arm64 was specified at some point for this not to work.
I’m really not sure how to go on from here.

For the second method

Here I tried to actually cross compile the modules, but now on my actual work machine, which is linux 20.04, x86_64. My idea was, if it was somehow cross compiling inside the jetson with native tools, let’s try to actually cross compile it from a different architecture.
So I basically did the same, I setup the environment for cross compiling (which I’ve done before for compiling the actual kernel sources). I setup a directory with the full kernel sources, ran the same setup, same commands (only now I’m explicitly providing the env variables CROSS_COMPILE=/usr/bin/aarch64-linux-gnu- ARCH=arm64 LOCALVERSION=-tegra.
It compiles with some warnings (Not sure how important they are, i’ll just attach the compile logs):
crosscompile_logs.txt (5.1 KB)
and I confirmed the module is indeed for aarch64

file option.ko 
option.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=94c395ed44a856871b01bf3d7e71adcdce1340e0, with debug_info, not stripped

So I moved these modules, cross compiled in my x86_64 machine, to the jetson, tried to install them but still I get the exec format error.
I’m not sure if this would work (though in my mind it should, since now, while I’m cross compiling, i’m not doing it with native tools so, from what I understood from you, should’ve been fine) but still couldn’t do it.

I insisted more on the first method, since it’s what you’ve actually recommended but I can’t get it to work. Not sure if you can spot something I did wrong or might’ve missed but either way, I think I still need some help here.

Hopefully this makes sense, I’m not very knowledgeable here but I think I’ve been understanding your points so far, but I’m still not able to get this to work.

Thank you so much for your help so far.
Francisco.

This is not in a particular order, I’m just adding notes as I read your reply. Each reply note might be before reading all of your reply.

Beware that although source_sync.sh is probably ok for a TX2 that there are some releases which might need to be downloaded directly from the web site for the specific L4T release. Basically though I think this should be ok for your case, along with the change of the symbolic link.

The mrproper and tegra_defconfig inside the full sources is correct. However, for modules to see this, you would also need to provide a “modules_prepare” step if no kernel Image is built first. Before you do that be sure to set CONFIG_LOCALVERSION. I happen to know this has no dependencies, and so it is ok to directly edit the produced “.config” file, or to use a config editor (e.g., menuconfig or nconfig). If you go to the “$TOP” (in your case “/usr/src/sources/kernel/kernel-4.9”), then you can edit the file produced during the “tegra_defconfig” step (perhaps the “/proc/config.gz”, after decompression and renaming to “.config” would be a better choice, but initially this should be no different than “tegra_defconfig”) to have this:
CONFIG_LOCALVERSION="-tegra"

One could also use the command line during the make, although I have not personally used this. I’m not positive, but I think this is the equivalent:
make LOCALVERSION=-tegra ...
(which is equivalent to what you suggested for “export LOCALVERSION=-tegra”)

Once this directory is set up with complete configuration and modules_prepare, you would then want to always use the “O=/some/where” alternate intermediate build location (and you’d still configure at the alternate location, but this would not modify “$TOP”) and not build with sudo. Mainly the configuration which the symbolic link “build” points to is for externally compiled modules which reference the running system’s configuration. If you were to build a full kernel or all modules, then the “O=/some/where” would still have its own independent configuration.

Does this waveshare script have its own kernel module source? If this script is building a standard module, versus building an out-of-tree module which is not in the original kernel source, it changes what is needed. If this were simply building in-tree (something existing as part of the default kernel source), then it wouldn’ t need any special script and I’d suspect the script is doing something to cause the false “exec format error” issue.

I’m thinking I saw that this waveshare source wants to replace the stock option.ko with its own version, in which case it would be valid to provide their own source and build it against the source which is mostly configured to match your running system. I say “mostly” because it might be necessary to do something odd to remove the existing “option.ko” module spec and instead build the external source using the same file name. This is where it gets confusing and I’m not sure what is going on in this step, but moving option.ko to a backup (option_bk.ko) says this is what is likely happening.

If you were to try to use a module which does not have a correct CONFIG_LOCALVERSION, then the error would have been different than exec format error. Probably whatever is wrong has nothing to do with CONFIG_LOCALVERSION.

The warning about missing “Modules.symvers” might not matter, but the combination of “modules_prepare” and building the actual modules should generate this. You might try building actual modules with sudo in “$TOP” to get a Modules.symvers if this is an issue. Then build your out-of-tree module again (but like I mentioned, I don’t know if this is needed…an error from this is likely different than “exec format error”).

The “file option.ko” says the format is correct, so something in the metadata of the build is at issue. Perhaps it is the missing Module.symvers file, and building modules in “$TOP” would help…not sure, but it is easy to find out. Just make sure the option.ko you are inserting is not the one you compiled earlier with the exec format error.

As long as the compiler version is correct it shouldn’t matter if you cross compile from an Ubuntu 18.04 host PC versus 20.04. It is good to test this. I see another warning about missing Module.symvers, and so perhaps this is important.

On the host PC, what do you see from:
/usr/bin/aarch64-linux-gnu-gcc --version
(and did you install the cross compiler from the NVIDIA web site which goes with your L4T release?)

In every case though it seems this “exec format error” is during compile of this out-of-tree content.

Hi.

I’ve been referring to this document (https://www.kernel.org/doc/Documentation/kbuild/modules.txt) to help me better understand what I’m actually doing.
From that I see that Module.symvers doesn’t generate from the build target modules_prepare, hence why I was having the warning shown previously (I’m assuming).

Going from the beginning now:

I’ve now synced the sources with my exact version just to be sure (which I should’ve done initially but…). So I checked my tegra release and synced the sources with r32.6.1, which obviously now exactly matches our version (even though you mention it shouldn’t be a problem).

Within the sources I’ve done, in order:

  • make mrproper
  • make tegra_defconfig
  • Changed CONFIG_LOCALVERSION="-tegra" in the generated .config
  • make modules

The build target make modules took a much longer but it now generated the Module.symvers file, which indeed removed the previous warning when compiling both drivers. Despite this warning now gone, I still get the Exec format error when using modprobe on the drivers. I haven’t been able to find/figure out much more that I can do or try.

For each driver, I found out about insmod, which does indeed insert the modules (though from my understanding, it doesn’t manage any dependencies or any extra work that the modprobe does, so I’m not sure how good of a solution that would be). I’m not sure if this command would/should yield Exec format error of some sort if it didn’t recognize the drivers, but the fact that it doesn’t show any errors and I can inspect the driver with lsmod, only further confuses me in this situation. Why does insmod work without errors and modprobe doesn’t. As in, it should (?) show an error still if you try to insert a module compiled against a different architecture (how would it work otherwise), yet it does not, so why does modprobe?

This makes sense to me, why would it not work though, I’ve looked through the changes in the source code of option.c, and there’s nothing particularly weird about it. I’m guessing this is still simply a problem of me (somehow) building the sources incorrectly.

This is the output:

aarch64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I didn’t give much thought about cross compiling it in the host PC, since I see no reason on why it shouldn’t work on the jetson itself. It’s more convenient to just develop on the jetson so I didn’t give much thought to that solution.

I am not sure honestly. Can you refer me to where I can find the different versions of the cross compiler? Can’t find the it.

I don’t know if you have any more ideas about this. Would it be much to ask if you can replicate any of this?

Thank you.

In the above, did you “make modules_prepare” before “make modules”? This, or building the target “make Image” would be needed prior to “make modules”. Without this the module configuration would still be invalid. The “make Image” is slow and not mandatory, but I like doing this once to see if something else will go wrong; it is a nice acid test for a number of issues which you might not see if only building modules.

The “exec format error” is a very serious problem and I wonder if the replacement “option.ko” is itself what triggers this. It is quite possible, if this is binary format, that this is what is failing. What do you see from “file option.ko” if you run this against the replacement module? Most of what you are doing seems correct, and there should be no possibility (other than due to quirks not normally run into) which would cause “exec format error”.

I could be wrong, but it sounds like the cross compiler you are using (version 9.4.0) is the one which comes directly with Ubuntu, and is not the one NVIDIA provides (the newer one might not work). I think your L4T release is R32.6.1, and the URL for that content is:
https://developer.nvidia.com/embedded/linux-tegra-r3261

On that URL page, look for “GCC 7.3.1 for 64 bit BSP and Kernel”. This is where you get version 7.3.1, which is known to work. I wouldn’t think your other version would cause exec format error, but you would want to use 7.3.1 anyway (I am still wondering if the replacement “option.ko” is at fault). Once you have that installed verify it is correct via:
/where/ever/it/is/aarch64-linux-gnu-gcc --version
(and your CROSS_COMPILE variable would be set to “/where/ever/it/is/aarch64-linux-gnu-”)

Yes, you are correct, compile should work on a Linux PC or natively. In fact though, you shouldn’t have any “aarch64-linux-gnu-gcc” if you are on the Jetson itself; this is only used for cross compile from the PC. If performing native compile, then there is no “CROSS_COMPILE=/some/where...”.

I didn’t know this. I’ll make sure keep that in mind from now on. Also, what exactly does make Image ensures? As in, I assume it builds the Image file, but when you say quote “but I like doing this once to see if something else will go wrong;”, what exactly could go wrong here that I would not see when building modules?
Just wondering if it’s really necessary because with the TX2 NX we have very tight disk space (and also it takes a long time, which I would prefer to avoid in these testing scenarios).


Since we don’t have any external storage, I’m currently limited to the 16gb (much less in reality) of the TX2 NX, which won’t be enough to make modules and/or make Image. So I think I will try work on my PC and crosscompile what I need.

I also changed the cross compiler I was using on my PC to the one provided by NVIDIA (as you suggested) and I think now it’s setup correctly, as seen in:

./aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Linaro GCC 7.3-2018.05) 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I’m assuming you mean to compare the output of both modules. In the following snippet option.ko is the custom module and option_original.ko is the original module.

file option*
option.ko:          ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=52e2b4323b47d37d4d25fbf2b82bf5ef5e93b36c, with debug_info, not stripped
option_original.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=186151ff475b92a54add3ad5677c491130f2cefc, not stripped

The only difference is the original doesn’t have with debug_info, not sure what to interpret from this, assuming this is what you asked for …


So right now my plan is to cross compile the drivers on my PC, copy them to the jetson and install them directly, since it’s faster and I don’t need to worry about disk space. I’m still having no luck though. These are the steps I took:

I’m not sure if I needed ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- on every single make command, but I’m assuming it would no harm when it was not needed. I’ve also changed the tegra_defconfig file, so that the CONFIG_LOCALVERSION="-tegra" is set when I do make tegra_defconfig (Might be redundant, but again, I don’t think it would do any harm if it is).
So what I did now (on my Ubuntu 20.04) was the following:

sudo ./source_sync.sh -k tegra-l4t-r32.6.1
cd sources/kernel/kernel-4.9
# make mrproper
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- mrproper
# make tegra_defconfig
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- tegra_defconfig
# CONFIG_LOCALVERSION="-tegra" is already set
# make modules_prepare
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- modules_prepare
# make modules
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- modules
# make Image
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- Image

Then I made the install script look like this:
install.sh

cd option
make
cd ../qmi_wwan_simcom
make

The makefiles for each module, respectively, look like this:
Makefile for option

obj-m:=option.o
optionmodule-objs:=module
KDIR:=/home/fzacarias/code/nvidia/sources/kernel/kernel-4.9
MAKE:=make
default:
	$(MAKE) ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) modules
clean:
	$(MAKE) -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) clean

Makefile for qmi_wwan_simcom

obj-m := qmi_wwan_simcom.o
# qmi_wwan_simcom-objs := qmi_wwan_simcom.o
KDIR:=/home/fzacarias/code/nvidia/sources/kernel/kernel-4.9
PWD := $(shell pwd)
OUTPUTDIR=/lib/modules/$(shell uname -r)/kernel/drivers/net/usb/
CONFIG_RETPOLINE=n
all: clean
	$(MAKE) ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- -C $(KDIR) M=$(PWD) modules

clean:
	rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions Module.* modules.order

Now on the TX2 NX, I copied the compiled modules from my PC to the respective directories in the jetson (/lib/modules/(uname -r)/kernel/drivers/usb/serial/ and /lib/modules/(uname -r)/kernel/drivers/net/usb/ )

Then I did:

sudo depmod
sudo modprobe option # modprobe: ERROR: could not insert 'option': Exec format error
sudo modprobe qmi_wwan_simcom # modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

So, from what you can tell by looking at this, is there anything I am doing wrong here? Anything new I should try? Or any other ideas I could try to pursue?


On another note, I also noticed now that I can’t load usb_wwan (that option depends on), and that has not been modified at all in the waveshare files. Not sure what to make of this. file usb_wwan also indicates it is compiled in correct architecture, actually all modules (inside drivers/usb/serial) appear to be compiled in for aarch64, but neither qmi_wwan_simcom, option nor usb_wwan (which again has not been modified) are able to be loaded. while the other drivers in the same directory are.
So doing something like modprobe usb_wwan would also yield Exec format error.


I also noticed now that, trying to do modprobe option with dmesg --follow opened in a parallel terminal, I see usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel). I will try to see what I can do about this.
Same for the qmi_wwan_simcom module. Dmesg shows cdc_wdm: exports duplicate symbol usb_cdc_wdm_register (owned by kernel). I understand what this means, but I’m not sure yet how to prevent it.

Thank you for your help.

For reference “Image” is the uncompressed kernel image…the full, integrated content. In order to build this all of the separate subdirectories of kernel source have to have configuration. When building “Image” you are guaranteed that the “.config” file at the root of the build is propagated to every subdirectory. If you don’t build “Image”, then you have to use alternate methods to propagate the Kconfig content to subdirectories. This would normally be “modules_prepare”, and you can skip building “Image”, but unless I’ve built that configuration before I tend to build Image just once to see if it all builds (which simultaneously runs the equivalent of “modules_prepare”, but tests things which might be skipped if building only modules).

Note that in a given configuration that integrated features are what load and use dynamically loadable modules. If there is an issue with the content which loads modules, but not the modules, and if you build only modules, it is possible something is wrong with configuration and you won’t know it until you load modules, or perhaps not until a newly loaded module is actually used. Modules are just a tiny subset of the full kernel. Nothing requires building the Image if just building modules, but modules will fail if the Kconfig system is not propagated to the kernel source subdirectories before starting that build, and having the .config file in itself will not be sufficient for this.

If you are building directly on the Jetson, then yes, you can skip building Image if you’ve built modules_prepare. I strongly suggest though that if building on a Jetson with limited space you attach something like external USB storage and build from that even if you are building only modules. If you have good ethernet connection, then perhaps you could even use the ssh filesystem “sshfs”. There are a number of tutorials on this, and some mentions of it in this forum. An example URL:
https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh
(I wouldn’t bother using this at a remote Internet location, but on a local gigabit this is quite useful)

Yes, the “Linaro GCC 7.3-2018.05” looks correct, and the “file” command shows this is the correct architecture. For size reduction you can strip debug information, but it is useful sometimes to have this during installation testing. The cross tools (or base tools if using the right command file name) will have something like “<tool_chain_path>/aarch64-linux-gnu-strip -–strip-unneeded <path-of-kernel-module.ko>” for stripping debug content when not needed. The “modules_install” step can also specify “INSTALL_MOD_STRIP=1” to do this.

Honestly, unless you are using the full Xavier or Orin you are probably going to cross compile a lot faster than you will native compile (and even on the full systems perhaps cross compile is still faster…it depends on the host PC).

If you are cross compiling, then you would always use “ARCH=arm64”. If not, this can cause some confusion to the kernel when loading modules and might fail thinking it is a foreign architecture. It shouldn’t so far as I know, but it does fail. Skip “ARCH” and “CROSS_COMPILE” if not cross compiling.

I too like editing tegra_defconfig to have “CONFIG_LOCALVERSION=-tegra” at times, but usually just hand edit the “.config” (I happen to know that CONFIG_LOCALVERSION has no dependencies, so it is ok to just use a text editor).

I also tend to use the download URL for the specific L4T release rather than using source_sync.sh. There are some cases where some of the content might be missing when using source_sync.sh. If it compiles, then it isn’t a problem, but if something is missing, then use the specific L4T release download. A listing of L4T releases is here:
https://developer.nvidia.com/linux-tegra

Note that if you do make the “Image”, then it should be prior to “modules”. Reversing the order defeats the purpose of testing the Image build in part because this allows skipping “modules_prepare”. So either:

  1. tegra_defconfig
  2. nconfig (if modifying)
  3. modules_prepare
  4. modules
    OR:
  5. tegra_defconfig
  6. nconfig (if modifying)
  7. Image
  8. modules

If this is cross compiled, then the Makefile for the option content is valid. However, if this is native compile, then you should remove the “ARCH” and “CROSS_COMPILE”. Make sure the “OUTPUTDIR” is the same as the “O=/some/where” in previous steps since this is where config propagates. This too should not use “ARCH” or “CROSS_COMPILE” unless it is a cross compile.

I would suggest “sudo depmod -a” and not just “sudo depmod”. I don’t use it without an option, and I don’t know for sure, but perhaps if you skip the “-a” it might need naming the specific module.

Any time you see “Exec format error” you are guaranteed that this kernel thinks the module is a foreign architecture. Regardless of what “file” says for architecture, the kernel does not agree should it be arm64/aarch64. Very likely this is from specifying ARCH and CROSS_COMPILE if you were compiling natively. All such files need to be recompiled without ARCH.

1 Like

Hello.
It’s been several working days and I’ve come to a solution.

I’ve implemented the modules statically in the kernel’s source, instead of trying to install these out of three modules that always yielded exec format error upon insertion (with modprobe). I understand that you said that this is always due to wrong architecture (and I’m not saying it’s not always the case, since you’re obviously much more knowledgeable than I am) but to clarify what was happening I think I need to push back a bit on that.
While using modprobe on one of the modules provided by waveshare, it resulted in exec format error, (implying a foreign architecture, even though when I am absolutely 100% sure it is compiled for the correct architecture, arm64, since it is cross compiled from an x86_64 machine, thus excluding the case you mention about specifying ARCH=arm64 when compiling natively). The dmesg logs showed that the module usb_wwan (which was a dependency of option) and qmi_wwan_simcom were exporting a duplicate symbols owned by the kernel.
I eventually found out that these modules I was trying to install were modifications of already existing modules, who where already exporting the same symbols of these modules I am trying to load.

Keep in mind this is my own interpretation of my observations during this whole, I am not very deeply knowledgeable about what happens behind the hood, therefore I could be incorrect about my conclusions.

Say, for example, when I was running modprobe option (and I mean the custom waveshare option driver), which depended on usb_wwan, with dmesg open on a separate window, I would see usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel). To me, this error message implies that there’s a completely different problem than what I was assuming (foreign architecture module). I’m glad I found this because I spent way too much time trying to compile the drivers under different setups when all resulted in the same problem, the duplicate symbol.

I figured then that the simplest solution was to just merge the changes to the driver directly to the kernel source and recompile the kernel. I’m not sure if that’s what you would call a “bad practice” as it would go against modularity, but for my context, if it works, I don’t think would be such a problem.

Long story short, I added the changes from the option driver to the source code of option.c and the changes in qmi_wwan_simcom.c to qmi_wwan.c and after I setup everything correctly and was able to successfully boot the TX2 with the now modified option and qmi_wwan. I am able to hotplug the 5G modem and, with their provided script, connect to 5G network.

I will wait for your thoughts on what I said before I mark anything as a solution, just to be sure I’m not miss guiding anyone, as these are simply my observations. If you feel you have something to correct me on, or to anything else to add, please do.

Thank you.

This was actually a really good thing to do. Sometimes error messages are misleading. A perfect error message can sometimes require knowing about an error which occurs because we don’t know what the error is…a bit of a “Catch 22” as the phrase goes.

Normally I would actually expect to see log messages about duplicate symbols. “exec format error” in no way suggests this, but it might just be what the issue has been all along due to bad error messages. It is true that if there is a duplicate symbol, then insmod would fail, but the message should state this as the reason (versus exec format error). I think this might be why the original install instructions might provide a substitute option.ko module: A need to remove or edit symbols. I don’t know for sure, but this now seems very likely.

Editing what is actually in the source code would certainly work, although editing what is in the modules would in theory do the same thing (provided the changes needed were in code within modules). My main worry with changing the kernel itself is that if there is a normal kernel upgrade at some point, then it would be a more elaborate “fix” to rebuild everything as the kernel upgrades. The result of changing this in module format versus integrated in the kernel is the same, and the only difference is the convenience of what happens when you must repeat the process.

Be certain to save the working kernel source, binary, and modules in case you need to either reinstall the kernel after an unexpected update, or else want to recompile the changes into a new kernel release.

Additionally, when you modify the kernel source, you should also rebuild 100% of the modules and use a new/different “uname -r” (a change to “CONFIG_LOCALVERSION” to get modules to go to a new location different than the unmodified kernel would use). I don’t know if the changes you made could be in a module instead, but should it be only a module which changes, then you wouldn’t need to rebuild the other modules, nor would you need a new “uname -r” (the two go together).

Up until now I’ve not noticed the “usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel)” message. I think the exec format error has prevented error messages from going that far. It would have saved a lot of time if the message had been stating this from the start.

Incidentally, you can find the name of the package owning a file, and put that package on hold. An example, to find the owner of “/boot/Image”, you can:
dpkg -S /boot/Image

Not all kernels for Jetsons are handled by packages, at least not in earlier releases. Later releases do put the kernel in a package since it is needed for OTA updates (which were not supported in earlier releases). However, let’s pretend the package owning “/boot/Image” is named (fictitiously) “jetson-kernel”. One could put this on hold from update:
echo 'jetson-kernel hold' | sudo dpkg --set-selections

Or remove it from hold:
echo 'jetson-kernel install' | sudo dpkg --set-selections

For more details on that see:
https://askubuntu.com/questions/18654/how-to-prevent-updating-of-a-specific-package

1 Like

From my point of view, this was simply a case of bad error messaging, misleading even. For now I’ve spent too much time on this and will have to move on for now, since we arrived at a solution (even if not the best, as you mentioned)

I will keep this in mind for the future, and make sure to keep the necessary backups for everything. Right now, issue is solved.

I will create a new thread if any new problem comes up.

Best regards,
Francisco Zacarias

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.