Error found in TX2 Boot: "nvdc: open: Permission denied" and NvRmMemInit failed

linuxdev · October 7, 2022, 3:32pm

In the above, did you “make modules_prepare” before “make modules”? This, or building the target “make Image” would be needed prior to “make modules”. Without this the module configuration would still be invalid. The “make Image” is slow and not mandatory, but I like doing this once to see if something else will go wrong; it is a nice acid test for a number of issues which you might not see if only building modules.

The “exec format error” is a very serious problem and I wonder if the replacement “option.ko” is itself what triggers this. It is quite possible, if this is binary format, that this is what is failing. What do you see from “file option.ko” if you run this against the replacement module? Most of what you are doing seems correct, and there should be no possibility (other than due to quirks not normally run into) which would cause “exec format error”.

I could be wrong, but it sounds like the cross compiler you are using (version 9.4.0) is the one which comes directly with Ubuntu, and is not the one NVIDIA provides (the newer one might not work). I think your L4T release is R32.6.1, and the URL for that content is:
https://developer.nvidia.com/embedded/linux-tegra-r3261

On that URL page, look for “GCC 7.3.1 for 64 bit BSP and Kernel”. This is where you get version 7.3.1, which is known to work. I wouldn’t think your other version would cause exec format error, but you would want to use 7.3.1 anyway (I am still wondering if the replacement “option.ko” is at fault). Once you have that installed verify it is correct via:
/where/ever/it/is/aarch64-linux-gnu-gcc --version
(and your CROSS_COMPILE variable would be set to “/where/ever/it/is/aarch64-linux-gnu-”)

Yes, you are correct, compile should work on a Linux PC or natively. In fact though, you shouldn’t have any “aarch64-linux-gnu-gcc” if you are on the Jetson itself; this is only used for cross compile from the PC. If performing native compile, then there is no “CROSS_COMPILE=/some/where...”.

FranciscoZacarias · October 10, 2022, 2:55pm

I didn’t know this. I’ll make sure keep that in mind from now on. Also, what exactly does make Image ensures? As in, I assume it builds the Image file, but when you say quote “but I like doing this once to see if something else will go wrong;”, what exactly could go wrong here that I would not see when building modules?
Just wondering if it’s really necessary because with the TX2 NX we have very tight disk space (and also it takes a long time, which I would prefer to avoid in these testing scenarios).

Since we don’t have any external storage, I’m currently limited to the 16gb (much less in reality) of the TX2 NX, which won’t be enough to make modules and/or make Image. So I think I will try work on my PC and crosscompile what I need.

I also changed the cross compiler I was using on my PC to the one provided by NVIDIA (as you suggested) and I think now it’s setup correctly, as seen in:

./aarch64-linux-gnu-gcc --version
aarch64-linux-gnu-gcc (Linaro GCC 7.3-2018.05) 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I’m assuming you mean to compare the output of both modules. In the following snippet option.ko is the custom module and option_original.ko is the original module.

file option*
option.ko:          ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=52e2b4323b47d37d4d25fbf2b82bf5ef5e93b36c, with debug_info, not stripped
option_original.ko: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), BuildID[sha1]=186151ff475b92a54add3ad5677c491130f2cefc, not stripped

The only difference is the original doesn’t have with debug_info, not sure what to interpret from this, assuming this is what you asked for …

So right now my plan is to cross compile the drivers on my PC, copy them to the jetson and install them directly, since it’s faster and I don’t need to worry about disk space. I’m still having no luck though. These are the steps I took:

I’m not sure if I needed ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- on every single make command, but I’m assuming it would no harm when it was not needed. I’ve also changed the tegra_defconfig file, so that the CONFIG_LOCALVERSION="-tegra" is set when I do make tegra_defconfig (Might be redundant, but again, I don’t think it would do any harm if it is).
So what I did now (on my Ubuntu 20.04) was the following:

sudo ./source_sync.sh -k tegra-l4t-r32.6.1
cd sources/kernel/kernel-4.9
# make mrproper
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- mrproper
# make tegra_defconfig
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- tegra_defconfig
# CONFIG_LOCALVERSION="-tegra" is already set
# make modules_prepare
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- modules_prepare
# make modules
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- modules
# make Image
sudo make ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- Image

Then I made the install script look like this:
install.sh

cd option
make
cd ../qmi_wwan_simcom
make

The makefiles for each module, respectively, look like this:
Makefile for option

obj-m:=option.o
optionmodule-objs:=module
KDIR:=/home/fzacarias/code/nvidia/sources/kernel/kernel-4.9
MAKE:=make
default:
	$(MAKE) ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) modules
clean:
	$(MAKE) -C $(KDIR) -I ./usb_wwan SUBDIRS=$(PWD) clean

Makefile for qmi_wwan_simcom

obj-m := qmi_wwan_simcom.o
# qmi_wwan_simcom-objs := qmi_wwan_simcom.o
KDIR:=/home/fzacarias/code/nvidia/sources/kernel/kernel-4.9
PWD := $(shell pwd)
OUTPUTDIR=/lib/modules/$(shell uname -r)/kernel/drivers/net/usb/
CONFIG_RETPOLINE=n
all: clean
	$(MAKE) ARCH=arm64 LOCALVERSION=-tegra CROSS_COMPILE=/home/fzacarias/l4t-gcc/gcc-linaro-7.3.1-2018.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- -C $(KDIR) M=$(PWD) modules

clean:
	rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions Module.* modules.order

Now on the TX2 NX, I copied the compiled modules from my PC to the respective directories in the jetson (/lib/modules/(uname -r)/kernel/drivers/usb/serial/ and /lib/modules/(uname -r)/kernel/drivers/net/usb/ )

Then I did:

sudo depmod
sudo modprobe option # modprobe: ERROR: could not insert 'option': Exec format error
sudo modprobe qmi_wwan_simcom # modprobe: ERROR: could not insert 'qmi_wwan_simcom': Exec format error

So, from what you can tell by looking at this, is there anything I am doing wrong here? Anything new I should try? Or any other ideas I could try to pursue?

On another note, I also noticed now that I can’t load usb_wwan (that option depends on), and that has not been modified at all in the waveshare files. Not sure what to make of this. file usb_wwan also indicates it is compiled in correct architecture, actually all modules (inside drivers/usb/serial) appear to be compiled in for aarch64, but neither qmi_wwan_simcom, option nor usb_wwan (which again has not been modified) are able to be loaded. while the other drivers in the same directory are.
So doing something like modprobe usb_wwan would also yield Exec format error.

I also noticed now that, trying to do modprobe option with dmesg --follow opened in a parallel terminal, I see usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel). I will try to see what I can do about this.
Same for the qmi_wwan_simcom module. Dmesg shows cdc_wdm: exports duplicate symbol usb_cdc_wdm_register (owned by kernel). I understand what this means, but I’m not sure yet how to prevent it.

Thank you for your help.

linuxdev · October 11, 2022, 5:51pm

For reference “Image” is the uncompressed kernel image…the full, integrated content. In order to build this all of the separate subdirectories of kernel source have to have configuration. When building “Image” you are guaranteed that the “.config” file at the root of the build is propagated to every subdirectory. If you don’t build “Image”, then you have to use alternate methods to propagate the Kconfig content to subdirectories. This would normally be “modules_prepare”, and you can skip building “Image”, but unless I’ve built that configuration before I tend to build Image just once to see if it all builds (which simultaneously runs the equivalent of “modules_prepare”, but tests things which might be skipped if building only modules).

Note that in a given configuration that integrated features are what load and use dynamically loadable modules. If there is an issue with the content which loads modules, but not the modules, and if you build only modules, it is possible something is wrong with configuration and you won’t know it until you load modules, or perhaps not until a newly loaded module is actually used. Modules are just a tiny subset of the full kernel. Nothing requires building the Image if just building modules, but modules will fail if the Kconfig system is not propagated to the kernel source subdirectories before starting that build, and having the .config file in itself will not be sufficient for this.

If you are building directly on the Jetson, then yes, you can skip building Image if you’ve built modules_prepare. I strongly suggest though that if building on a Jetson with limited space you attach something like external USB storage and build from that even if you are building only modules. If you have good ethernet connection, then perhaps you could even use the ssh filesystem “sshfs”. There are a number of tutorials on this, and some mentions of it in this forum. An example URL:
https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh
(I wouldn’t bother using this at a remote Internet location, but on a local gigabit this is quite useful)

Yes, the “Linaro GCC 7.3-2018.05” looks correct, and the “file” command shows this is the correct architecture. For size reduction you can strip debug information, but it is useful sometimes to have this during installation testing. The cross tools (or base tools if using the right command file name) will have something like “<tool_chain_path>/aarch64-linux-gnu-strip -–strip-unneeded <path-of-kernel-module.ko>” for stripping debug content when not needed. The “modules_install” step can also specify “INSTALL_MOD_STRIP=1” to do this.

Honestly, unless you are using the full Xavier or Orin you are probably going to cross compile a lot faster than you will native compile (and even on the full systems perhaps cross compile is still faster…it depends on the host PC).

If you are cross compiling, then you would always use “ARCH=arm64”. If not, this can cause some confusion to the kernel when loading modules and might fail thinking it is a foreign architecture. It shouldn’t so far as I know, but it does fail. Skip “ARCH” and “CROSS_COMPILE” if not cross compiling.

I too like editing tegra_defconfig to have “CONFIG_LOCALVERSION=-tegra” at times, but usually just hand edit the “.config” (I happen to know that CONFIG_LOCALVERSION has no dependencies, so it is ok to just use a text editor).

I also tend to use the download URL for the specific L4T release rather than using source_sync.sh. There are some cases where some of the content might be missing when using source_sync.sh. If it compiles, then it isn’t a problem, but if something is missing, then use the specific L4T release download. A listing of L4T releases is here:
https://developer.nvidia.com/linux-tegra

Note that if you do make the “Image”, then it should be prior to “modules”. Reversing the order defeats the purpose of testing the Image build in part because this allows skipping “modules_prepare”. So either:

tegra_defconfig
nconfig (if modifying)
modules_prepare
modules
OR:
tegra_defconfig
nconfig (if modifying)
Image
modules

If this is cross compiled, then the Makefile for the option content is valid. However, if this is native compile, then you should remove the “ARCH” and “CROSS_COMPILE”. Make sure the “OUTPUTDIR” is the same as the “O=/some/where” in previous steps since this is where config propagates. This too should not use “ARCH” or “CROSS_COMPILE” unless it is a cross compile.

I would suggest “sudo depmod -a” and not just “sudo depmod”. I don’t use it without an option, and I don’t know for sure, but perhaps if you skip the “-a” it might need naming the specific module.

Any time you see “Exec format error” you are guaranteed that this kernel thinks the module is a foreign architecture. Regardless of what “file” says for architecture, the kernel does not agree should it be arm64/aarch64. Very likely this is from specifying ARCH and CROSS_COMPILE if you were compiling natively. All such files need to be recompiled without ARCH.

FranciscoZacarias · October 27, 2022, 12:01pm

Hello.
It’s been several working days and I’ve come to a solution.

I’ve implemented the modules statically in the kernel’s source, instead of trying to install these out of three modules that always yielded exec format error upon insertion (with modprobe). I understand that you said that this is always due to wrong architecture (and I’m not saying it’s not always the case, since you’re obviously much more knowledgeable than I am) but to clarify what was happening I think I need to push back a bit on that.
While using modprobe on one of the modules provided by waveshare, it resulted in exec format error, (implying a foreign architecture, even though when I am absolutely 100% sure it is compiled for the correct architecture, arm64, since it is cross compiled from an x86_64 machine, thus excluding the case you mention about specifying ARCH=arm64 when compiling natively). The dmesg logs showed that the module usb_wwan (which was a dependency of option) and qmi_wwan_simcom were exporting a duplicate symbols owned by the kernel.
I eventually found out that these modules I was trying to install were modifications of already existing modules, who where already exporting the same symbols of these modules I am trying to load.

Keep in mind this is my own interpretation of my observations during this whole, I am not very deeply knowledgeable about what happens behind the hood, therefore I could be incorrect about my conclusions.

Say, for example, when I was running modprobe option (and I mean the custom waveshare option driver), which depended on usb_wwan, with dmesg open on a separate window, I would see usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel). To me, this error message implies that there’s a completely different problem than what I was assuming (foreign architecture module). I’m glad I found this because I spent way too much time trying to compile the drivers under different setups when all resulted in the same problem, the duplicate symbol.

I figured then that the simplest solution was to just merge the changes to the driver directly to the kernel source and recompile the kernel. I’m not sure if that’s what you would call a “bad practice” as it would go against modularity, but for my context, if it works, I don’t think would be such a problem.

Long story short, I added the changes from the option driver to the source code of option.c and the changes in qmi_wwan_simcom.c to qmi_wwan.c and after I setup everything correctly and was able to successfully boot the TX2 with the now modified option and qmi_wwan. I am able to hotplug the 5G modem and, with their provided script, connect to 5G network.

I will wait for your thoughts on what I said before I mark anything as a solution, just to be sure I’m not miss guiding anyone, as these are simply my observations. If you feel you have something to correct me on, or to anything else to add, please do.

Thank you.

linuxdev · October 27, 2022, 8:52pm

This was actually a really good thing to do. Sometimes error messages are misleading. A perfect error message can sometimes require knowing about an error which occurs because we don’t know what the error is…a bit of a “Catch 22” as the phrase goes.

Normally I would actually expect to see log messages about duplicate symbols. “exec format error” in no way suggests this, but it might just be what the issue has been all along due to bad error messages. It is true that if there is a duplicate symbol, then insmod would fail, but the message should state this as the reason (versus exec format error). I think this might be why the original install instructions might provide a substitute option.ko module: A need to remove or edit symbols. I don’t know for sure, but this now seems very likely.

Editing what is actually in the source code would certainly work, although editing what is in the modules would in theory do the same thing (provided the changes needed were in code within modules). My main worry with changing the kernel itself is that if there is a normal kernel upgrade at some point, then it would be a more elaborate “fix” to rebuild everything as the kernel upgrades. The result of changing this in module format versus integrated in the kernel is the same, and the only difference is the convenience of what happens when you must repeat the process.

Be certain to save the working kernel source, binary, and modules in case you need to either reinstall the kernel after an unexpected update, or else want to recompile the changes into a new kernel release.

Additionally, when you modify the kernel source, you should also rebuild 100% of the modules and use a new/different “uname -r” (a change to “CONFIG_LOCALVERSION” to get modules to go to a new location different than the unmodified kernel would use). I don’t know if the changes you made could be in a module instead, but should it be only a module which changes, then you wouldn’t need to rebuild the other modules, nor would you need a new “uname -r” (the two go together).

Up until now I’ve not noticed the “usb_wwan: exports duplicate symbol usb_wwan_chars_in_buffer (owned by kernel)” message. I think the exec format error has prevented error messages from going that far. It would have saved a lot of time if the message had been stating this from the start.

Incidentally, you can find the name of the package owning a file, and put that package on hold. An example, to find the owner of “/boot/Image”, you can:
dpkg -S /boot/Image

Not all kernels for Jetsons are handled by packages, at least not in earlier releases. Later releases do put the kernel in a package since it is needed for OTA updates (which were not supported in earlier releases). However, let’s pretend the package owning “/boot/Image” is named (fictitiously) “jetson-kernel”. One could put this on hold from update:
echo 'jetson-kernel hold' | sudo dpkg --set-selections

Or remove it from hold:
echo 'jetson-kernel install' | sudo dpkg --set-selections

For more details on that see:
https://askubuntu.com/questions/18654/how-to-prevent-updating-of-a-specific-package

FranciscoZacarias · November 7, 2022, 10:31am

From my point of view, this was simply a case of bad error messaging, misleading even. For now I’ve spent too much time on this and will have to move on for now, since we arrived at a solution (even if not the best, as you mentioned)

I will keep this in mind for the future, and make sure to keep the necessary backups for everything. Right now, issue is solved.

I will create a new thread if any new problem comes up.

Best regards,
Francisco Zacarias

system · November 21, 2022, 10:31am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Error found in TX2 Boot: "nvdc: open: Permission denied" and ***NvRmMemInit failed***

Error found in TX2 Boot: "nvdc: open: Permission denied" and NvRmMemInit failed