DKMS fails to compile NVIDIA modules (340.32/343.13) during kernel (3.16.1) compile process on Debia

About a week ago, I had the need to compile a newer kernel than the one that came with my Debian 7, so I downloaded the newest stable one from kernel.org, and gave it a try.
While compiling the kernel and using it went great, the nvidia module has been giving me a bit of trouble.

The nvidia-kernel-dkms package that is found in the Debian 7 repos appears to be too old to work with the latest stable kernel, so the first thing I did before even downloading the new kernel was remove every trace of nvidia-kernel-dkms, and all other nvidia packages from my system. I then downloaded the latest NVIDIA driver from nvidia.com (340.32). Installing this on the latest debian kernel (3.2.0-4-amd64) went fine. I let it register with DKMS, so that it automatically compiles for any new kernels I install.

The problem came after compiling the new kernel. Basically, when the new kernel (3.16.1-1-amd64) is compiled, I have 2 options to install it:

Manual:
	#Install kernel modules
	user@pc:/usr/src/linux-3.16.1$ sudo make modules_install

	#Manually copy kernel to /boot
	user@pc:/usr/src/linux-3.16.1$ sudo cp -v arch/x86_64/bzImage /boot/vmlinuz-3.16.1-1-amd64

	#Manually create new initrd image
	user@pc:/usr/src/linux-3.16.1$ sudo mkinitramfs -o /boot/initrd.img-3.16.1-1-amd64 3.16.1-1-amd64

	#Manually update grub
	user@pc:/usr/src/linux-3.16.1$ sudo update-grub

	#Manually have dkms compile and install all modules registered with it
	#(in my case, NVIDIA is the only modules registred with DKMS) for the new kernel
	user@pc:/usr/src/linux-3.16.1$ sudo dkms autoinstall -k 3.16.1-1-amd64

Automatic:
	#Install kernel modules
	user@pc:/usr/src/linux-3.16.1$ sudo make modules_install

	#This command runs all the postinst scripts under /etc/kernel/postinst.d/, which automatically copies the kernel to /boot, 
	#creates an initrd image for it, updates grub, and has DKMS try to compile and install all modules registered with it
	user@pc:/usr/src/linux-3.16.1$ sudo make install

The manual way works great. The NVIDIA module compiles and installs without a problem.
The automatic way however fails to build the NVIDIA module.
I can use the manual way without a problem on my system, but I figured I’d find out why the automatic way doesn’t work, and maybe bring attention to a problem if the devs weren’t aware of it.

The following all deals with the Automatic way of compiling and installing the nvidia module.

Setting the verbose variable to ‘1’ for DKMS under /etc/dkms/framework.conf, I noticed that the first problem DKMS was running into was running ‘make clean’ under /var/lib/dkms/nvidia/340.32/build/. The verbose output of the ‘make clean’ command looked something like this:

run-parts: executing /etc/kernel/postinst.d/dkms 3.16.1-1-amd64 /boot/vmlinuz-3.16.1-1-amd64
make[2]: f: Command not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
make[2]: rf: Command not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: rf: not found
make[2]: [clean] Error 127 (ignored)
make[2]: f: Command not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: f: not found
make[2]: [clean] Error 127 (ignored)
make[2]: rf: Command not found
make[2]: [clean] Error 127 (ignored)
/bin/sh: 1: rf: not found
make[2]: [clean] Error 127 (ignored)
Error! Bad return status for module build on kernel: 3.16.1-1-amd64 (x86_64)
Consult /var/lib/dkms/nvidia/343.13/build/make.log for more information.

After a lot of digging, I figured out that the problem was that the kernel makefile (/usr/src/linux-3.16.1/Makefile in my case) had a line that read:

MAKEFLAGS += -rR --no-print-directory

and the ‘-R’ flag that it was setting was making the ‘make clean’ command fail. I’m not sure what the ‘-R’ flag does, or if it’s safe to remove from the Makefile, but doing so fixes the problem, and allows ‘make clean’ to run without error.

Next, DKMS tries to run:

$ make KERNELRELEASE=3.16.1-1-amd64 module KERNEL_UNAME=3.16.1-1-amd64

under /var/lib/dkms/nvidia/340.32/build, which fails. Attached is the resulting make.log.
I’ve also tried this with the 343.13 beta driver, with a very similar result - telling dkms to compile it using “dkms autoinstall” works, while having the kbuild system do it fails. The make.log of that is also attached.
Am I doing something wrong, or is this what the NVIDIA Makefile is referring to when it says:
“The new approach currently has its own share of problems, some of which are architectural difficulties with KBUILD…”?

Note: You’ll see in the log files a lot of references to vmware. Originally, I was trying to get this to work on my home PC (with an NVIDIA GTX570 card), but I’m at work now, 500KM away from that machine, so now I’ve reproduced the problem in vmware. While vmware doesn’t have an NVIDIA GPU, the downloaded nvidia driver had no problems installing on the debian kernel (3.2.0-4-amd64), and while the nvidia module won’t load, I’m trying to compile it - not run it, so that shouldn’t be a problem.
340.32.make.log (2.83 KB)
343.13.make.log (2.83 KB)
nvidia-bug-report.log.gz (57.1 KB)
nvidia-installer.log (3.11 KB)

I seem to have figured out the problem (but not a permanent solution, which probably requires the editing of the nvidia makefile). These are all the environment variables that were causing dkms to fail to build the nvidia module for me.

There are basically 3 stages to dkms building the nvidia module. All 3 are done while the CWD is /var/lib/dkms/nvidia/340.32/build. The following is the list of requirements under each stage to let the stage complete successfully:

1. make clean; make -C uvm clean
Removing the "-R" flag from the MAKEFLAGS

2. make KERNELRELEASE=3.16.1-1-amd64 module KERNEL_UNAME=3.16.1-1-amd64
unsetting ARCH, KBUILD_EXTMOD, and MAKEFLAGS

3. make -C uvm module KERNEL_UNAME=3.16.1-1-amd64 KBUILD_EXTMOD=/var/lib/dkms/nvidia/340.32/build/uvm
unsetting "obj"

To do all this in one go, I can just edit the /etc/dkms/framework.conf file and add these lines to it:

unset MAKEFLAGS
unset ARCH
unset KBUILD_EXTMOD
unset obj

which allows dkms to compile the nvidia module successfully.

before these variables are unset, these are their contents:

MAKEFLAGS=" --no-print-directory -RrI . -- obj=arch/x86/boot"
ARCH=x86
KBUILD_EXTMOD=
obj=arch/x86/boot

Note: While KBUILD_EXTMOD is set to nothing, it definately MUST BE UNSET before nvidia module will build for me. No idea why this is.

To summarize:

On Debian 7, if you’re compiling a new kernel from kernel.org, and run “sudo make install” to install the newly compiled kernel, one of the things that will run is /etc/kernel/postinst.d/dkms, which will try to compile and install all modules registered with dkms for the new kernel. If NVIDIA is one of those modules, it will fail. This appears to be because the environment that is created by the new kernel’s Makefile has some variables set that mess up “make” when dkms tries to compile the nvidia module. If you tell dkms to unset these variables however, by adding the following lines to /etc/dkms/framework.conf ahead of time, it will compile and install the nvidia module without a problem:

unset MAKEFLAGS
unset ARCH
unset KBUILD_EXTMOD
unset obj

Changing /etc/dkms/framework.conf is clearly not a long-term solution though, so if the devs can look into modifying the NVIDIA Makefile to handle these environment variables automatically (which I’m assuming is what’s required), that would be great.