CUDA and KVM - Jetpack 4.3

Hello, I’m facing an issue with my TX2 board that I would like to address.

I’m working in an environment where I need to have KVM running in the host as well as CUDA. Turns out that after enabling KVM in the kernel, CUDA stops working. I would like to know if I am doing something wrong or if this is really a bug.

Repro steps:

First, we need to start with a fresh installation of the latest (at the time of writing) Jetpack 4.3 including CUDA.

After doing that, CUDA should work. Next we enable KVM in the kernel:

git clone https://github.com/jetsonhacks/buildJetsonTX2Kernel.git
cd buildJetsonTX2Kernel
git checkout tags/vL4T32.3.1

We are getting vL4T32.3.1 (the latest available at the time of writing), since it’s the matching kernel version for Jetpack 4.3.

sudo ./getKernelSources.sh
sudo ./editConfig.sh

The menuconfig window will pop up, select Virtualization -> Enable KVM; then Save and Exit.

sudo ./makeKernel.sh
./copyImage.sh
sudo reboot

After you reboot, CUDA will no longer work. For example, executing the vectorAdd example from CUDA samples results in the following error:

root@nvidia:/usr/local/cuda-10.0/samples/0_Simple/vectorAdd# ./vectorAdd

[Vector addition of 50000 elements]
Failed to allocate device vector A (error code no CUDA-capable device is detected)!

Kvm however will be enabled:

$ dmesg | grep “kvm”

kvm [1]: Hyp mode initialized successfully

Another relevant point to note is that if you edit again the menuconfig to remove KVM, CUDA will work again:

sudo ./editConfig.sh

The menuconfig window will pop up, select Virtualization -> Disable KVM; then Save and Exit.

sudo ./makeKernel.sh
./copyImage.sh
sudo reboot

Now CUDA works again but KVM is disabled (not desired).

Along this issue we also faced:

Another issue we faced that might be related is that despite having kvm enabled (with CUDA not working), if we configure the network and install qemu-kvm, libvirt0, virt-manager and bridge-utils through apt-get, the TX2 won’t boot after a shutdown. The boot menu would report the following:

x_tables: disagrees about version of symbol module_layout

Any ideas?

Thank you in advance

I have not tried KVM, but if your kernel build did not correctly set “CONFIG_LOCALVERSION”, then modules would not be found. If your initial configuration did not match the current kernel before making the KVM edits, then your kernel features were probably very wrong.

On your unmodified kernel, run “uname -r”. This will be a combination of the base kernel version (e.g., “4.4.140”) and a suffix (e.g., “-tegra”). To match this suffix the “.config” file would need:
CONFIG_LOCALVERSION="-tegra"

When booted the modules are found in “/lib/modules/$(uname -r)/”. If “uname -r” changed and you did not install all modules in the new location, then all modules would fail to load and things would break.

I have no idea if those scripts are current for JetPack/SDKM 4.3/L4T R32.3.1. I have no idea if those scripts start with a running system’s “/proc/config.gz”. That is where you should look first.

Hello linuxdev, thank you for your answer!

Let me clarify a couple of points, this script works directly with the L4T R32.3.1, which matches the kernel version that gets installed on Jetpack 4.3. In fact, this set of scripts just automate the L4T kernel compilation and kernel installation process without modifying its sources.

I confirm that the getKernelSources.sh script https://github.com/jetsonhacks/buildJetsonTX2Kernel/blob/master/scripts/getKernelSources.sh
starts with the /proc/config.gz (line 26) of the working kernel of a fresh Jetpack 4.3 installed system.

I also confirm that “uname -r” reports exactly the same in both the working and non-working versions.

Note that the conflict only occurs when I enable kvm in the kernel configuration with these scripts (L4T R32.3.1). If I use the same scripts without enabling kvm, CUDA will work again as the compiled kernel has an identical configuration with the precompiled one from NVIDIA’s Jetpack.

In short, it is like a toggle behaviour, if kvm is enabled CUDA won’t work, and if kvm is disabled CUDA works. I guess that there is a bug/incompatibility between KVM and the CUDA driver.

Specifically for the KVM feature, if you’ve previously started via the “/proc/config.gz”, how are you editing this? For example “make menuconfig”? I’m not sure if a script would do as expected in all cases, and I’m hoping you can make that edit through a config editor instead of through a script (you might already be doing this, but I have to verify).

Note that if you create a “.config” which was originally from an uncompressed version of “/proc/config.gz”, and then edit to have:
CONFIG_LOCALVERSION="-tegra"
…this will be a guaranteed 100% exact match to the original running kernel. The only requirement here prior to building the kernel and/or modules is to use a config editor (my favorite is "make nconfig" because it can search for symbols, but is otherwise like the “menuconfig” version of this) to make the change you want. I do not know if the script is changing something more or not.

Once you boot with this new kernel, if you have saved a safe copy of the original “/proc/config.gz” (always keep a saved/safe copy of this before you do any kernel work), then you could do a diff which compares the new config.gz to the old config.gz and know with a guarantee exactly what has changed. This in turn would probably say exactly why the new kernel is failing, or at least offer clues as to whether the issue is related to the kernel or to the user space content.

Hi linuxdev,

I confirm that I follow the standard procedure of the L4T R32.3.1 kernel compilation and setup, based on the /proc/config.gz of a fresh Jepack 4.3 installation on my TX2, and that I edit it so that CONFIG_LOCALVERSION="-tegra".

Below you can find the diff between the .config which is created when I enable kvm through make menuconfig and the original .config of the fresh installation without kvm (which is identical with the .config which is created when I disable kvm with make menuconfig).

diff config_kvm_disabled config_kvm_enabled
345a346
> CONFIG_PREEMPT_NOTIFIERS=y
465a467
> CONFIG_ARM64_ERRATUM_834220=y
5921a5924,5934
> CONFIG_HAVE_KVM_IRQCHIP=y
> CONFIG_HAVE_KVM_IRQFD=y
> CONFIG_HAVE_KVM_IRQ_ROUTING=y
> CONFIG_HAVE_KVM_EVENTFD=y
> CONFIG_KVM_MMIO=y
> CONFIG_HAVE_KVM_MSI=y
> CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
> CONFIG_KVM_VFIO=y
> CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL=y
> CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
> CONFIG_KVM_COMPAT=y
5923c5936,5939
< # CONFIG_KVM is not set

> CONFIG_KVM_ARM_VGIC_V3_ITS=y
> CONFIG_KVM=y
> CONFIG_KVM_ARM_HOST=y
> CONFIG_KVM_ARM_PMU=y

To summarise again, when kvm is enabled in the kernel, CUDA stops working and when kvm is disabled, CUDA works as usual.

This is an interesting issue. So far as I know the CONFIG_ARM64_ERRATUM_834220 will not have any relation to this, and would only show up on a translation error (see CONFIG_ARM64_ERRATUM_834220 or kernel source “Documentation/arm64/silicon-errata.txt”). This should have no relation to CUDA operation even if certain valid errors occur, but this would probably alter the error message and not cause an error. However, to verify, when this issue occurs and you are monitoring “dmesg --follow”, do you see any error log?

Any details about CUDA not working would be very useful. Error messages, dmesg errors, so on, would help. How you are logged in (local GUI, so on) would help. I am thinking this may not be the KVM code itself causing failure, but perhaps something such as the “DISPLAY” environment variable changing when using KVM. In both the fail and non-fail cases, what do you see from “echo $DISPLAY”?

What do you get from both fail and non-fail cases from (note there is a space in front of “version”):
glxinfo | egrep -i '(nvidia|mesa| version|nouveau)'

I’m looking for the possibility that the issue is what is being rendered to instead of being from the actual kernel feature change, and thus I’m looking at changes between with or without KVM in those test conditions (which should mostly be unchanged regardless of whether KVM is used or not, although perhaps monitor resolution would change).

Hello again linuxdev,

following your suggestions I collected the following information. I connect a monitor with HDMI and I execute the commands locally.

Here is the output when I execute with the working kernel (the one in which kvm is not enabled):

$ echo $DISPLAY
:0

$ glxinfo | egrep -i ‘(nvidia|mesa| version|nouveau)’
server glx vendor string: NVIDIA Corporation
server glx version string: 1.4
client glx vendor string: NVIDIA Corporation
client glx version string: 1.4
GLX version: 1.4
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: NVIDIA Tegra X2 (nvgpu)/integrated
OpenGL core profile version string: 4.6.0 NVIDIA 32.3.1
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL version string: 4.6.0 NVIDIA 32.3.1
OpenGL shading language version string: 4.60 NVIDIA
OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 32.3.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

However, when it comes to the non-working version (that is when kvm is enabled in the compilation), everything related to the GPU is broken, since not even the display driver works. The screen would flicker while booting, switching back to the kernel loading messages for a couple of seconds and flicker again, in a non terminating loop. Probably that’s why I get a lot of tegradc 15210000.nvdisplay:blank - powerdown messages in dmesg (see the attachment). Therefore I am only able to operate with the TX2 using the console (by pressing F2 or through ssh).

Since I am unable to use the native console due to the flickering, I couldn’t get the echo $DISPLAY information, but it is obvious that something wrong is going on. Also, the boot screen reports that the nvpmodel service fails to load.

However I was able to collect a full dmesg log that I will be attaching to this post. I have to mention also that executing the vector add which fails to execute correctly, doesn’t generate any additional output to dmesg.

All in all, this definitively looks like unintended behavior.

Dmesg output: https://pastebin.com/bqs8tciQ

Edit: This issue doesn’t happen in Jetpack 4.2.0 (which has another issues regarding qemu), KVM-CUDA incompatibility issue happens starting in Jetpack 4.2.1.

It looks like mostly the installation works, but hot plug detect is not working 100% correct on that board when the KVM is used. If the underlying monitor is plugged directly into the HDMI without any adapter of any kind, does that unit start working in GUI mode?

Note that since HDMI is hot plug you can boot without HDMI, and then plug in after the boot completes. Or you can leave HDMI in, and after boot unplug and replug (give it a few seconds between unplug/replug). Knowing what happens with the monitor directly connected could help…it may be as simple as being sensitive to timing changes when going through the KVM.

In all my experiments the monitor was plugged directly to the HDMI without any kind of adapter, which is when the flicker happens.

If I boot with the HDMI unplugged and I wait some time to plug it, I can press F2 to access the console mode and I can operate normally (The flicker seems to stop after some time, probably a timeout). And obviously, glxinfo | egrep -i '(nvidia|mesa| version|nouveau)' returns “Unable to open device”.

I think NVIDIA has enough information already to reproduce the bug, so we will wait for them addressing the issue.

I can confirm that this issue also happens with the latest Jetpack 4.4

Yes, at this point it might be an actual hardware issue.

It doesn’t seem a hardware issue because I got kvm and CUDA working in Jetpack 4.2, but I need it working in the latest, because I need also docker. Btw, I have also tried all the Jetpacks from 4.2.1 and on and they have the same software bug also present in 4.3 and 4.4