problem: GPGPU on Xen kernels nvidia drivers do not seem to work on xen kernels

Hi guys,

I spent the whole week trying to setup a Ubuntu 10.04 64-bit machine for Xen virtualization (Xen 4.1) for GPGPU tests on several virtual machines (using Xen’s GPU passthrough capability). My problem is that I just can’t get the nvidia development driver to work on the xen kernel (works perfectly fine with the standard kernel).

For compiling the kernel, I followed the instructions from http://www.zeroaccess.org/2011/04/xen-4-1-on-ubuntu-10-04-64bit/ (but built the kernel the debian way (make-kpkg) to get a nice .deb package). It boots fine, I modified grub to set the kernel options, etc. All the Xen setup seems to be working as it should.

I can’t get the NVIDIA driver working on the host (Dom0) though. I used the driver with CUDA 3.2, and also the 4.0 RC2 driver (for a GTX 590 card). When the system tries to start X11, the screen turns blank and the system gets very slow. Booting the system into text mode works fine. I can load the nvidia driver manually (modprobe nvidia), and create the device nodes in /dev using mknode (so I have the /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl - with major number 195, and minors 0, 1, 255, respectively). When I try to build anything using OpenCL, it just reports that no platforms have been found. With CUDA, I get the error: “cudaSafeCall() Runtime API error : invalid device ordinal.” Both work completely fine when I boot the system into a standard kernel (linux-image-generic, default with ubuntu 10.04).

The X11 log just says that it failed to load the NVIDIA module. Syslog gives messages like “NVRM: RmInitAdapter failed!” .

I tried various suggestions for installing the the driver found on the web (e.g., http://wiki.xensource.com/xenwiki/NvidiaGPU?highlight=(nvidia) ) but with no success.

Did any of you get NVIDIA and Xen work together?? How? Any help is appreciated!

the nvidia drivers do not work with ubuntu dom0. at least not with any of the kernels and xorg that comes with ubuntu.
nvidia drivers do work with dom0 based on opensuse 11.3 and 11.4 that i am currently using. you could also use the master kernel archive from opensuse, build that on ubuntu and possibly get it to work on your system. but if you want it to be the easiest, i’d go with opensuse. xen just installs. to install nvidia driver, all you have to do while under xen is “export IGNORE_XEN_PRESENCE=1” before you run the .sh file.

Thanks for the quick reply. I did as you said and switched to OpenSuSE 11.4 (64bit). I got the nvidia drivers working on the native (non-xen) kernel and all SDK examples work with no problem. I did the ignore xen trick and installed them on the xen kernel as well. The X server works fine with this driver on the XEN Dom0. But when I run any of the CUDA SDK examples I get an error telling me that all CUDA devices are busy or unavailable. I also did this in text mode, to be sure it is nothing in the X server which causes this - but with the same results. The OpenCL examples just return an out_of_resources error. They all detect the GTX 590 graphics card fine however. Do you have any ideas what might be wrong?

I am using the 270.41.6 driver now, and installed it as described here in XEN: http://old-en.opensuse.org/Talk:Use_Nvidia_driver_with_Xen (last post on the bottom). Also installed the NVIDIA driver for the native kernel using opensuse’s community repository, so that all the libraries and header files are where they should be. So I just placed a manually-built nvidia.ko into the /lib/modules/xxx-xen, the user-space part is already there (versions match).

currently, i’m also suffering similar problem. My enviroment is centos 5.5.

I also got the kernel warning messages below.

May 17 00:16:21 localhost kernel: NVRM: bad caching on address 0xffff8805b8aa5000: actual 0x77 != expected 0x73

May 17 00:16:21 localhost kernel: NVRM: please see the README section on Cache Aliasing for more information

May 17 00:16:21 localhost kernel: NVRM: bad caching on address 0xffff8805b8aa6000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805ba686000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805bb754000: actual 0x67 != expected 0x63

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805bbad3000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805bada7000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805ba806000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805ba7e0000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805bb757000: actual 0x77 != expected 0x73

May 17 00:16:22 localhost kernel: NVRM: bad caching on address 0xffff8805ba649000: actual 0x77 != expected 0x73

Any suggestion?