2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple

This isn’t an xorg.conf problem. Its an issue with your system. This looks like an SBIOS bug. Verifying that you have the latest SBIOS would be a good idea.

I should also note that 180.06 no longer supported, and Ubuntu-8.10 isn’t supported at all right now with CUDA. Testing with the latest released driver would be a good idea.

Do you mean system BIOS? I have the latest system BIOS. But it could still be that. It’s a finicky X58-based motherboard from MSI, that was released too soon. They should had tested it with standard PCI graphics cards. I had earlier problems with RAM and clock speed.

Currently, I’m trying to start ubuntu without starting X (created a custom initlevel3 by disabling gdm) and I’ll modprobe the nvidia devices to get CUDA working on them. I’ll post on how that goes, and hopefully we can close this thread.

Yes, SBIOS = system BIOS. I’d be pleasantly surprised if your workaround for not starting X helped. Your motherboard doesn’t seem capable of accessing all the GPUs correctly from a low level.

That’s a scary thought. But, I have gotten this far since yesterday:

Installed the latest 180.22 drivers for cuda 2.1 and also the toolkit and sdk; ran “make” on the SDK to get the binaries

Disabled gdm in run level 3 (using sysv-rc-conf) and made it the default run level by editing /etc/inittab

Replaced the existing xorg.conf with the custom-made version attached

Edited the /etc/rc.local script to call the attached cuda.sh script (that I found online) to “modprobe nvidia” and to add the /dev/nvidia* entries

On trying to run any CUDA app, it craps out and shows a version conflict between the kernel module and the driver, which I was hoping someone might know how to fix.

[codebox]cyriac@gpu2:~$ ./NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery

Error: API mismatch: the NVIDIA kernel module has version 96.43.05,

but this NVIDIA driver component has version 180.22. Please make

sure that the kernel module and all NVIDIA driver components

have the same version.

cudaSafeCall() Runtime API error in file <deviceQuery.cu>, line 59 : initialization error.

cyriac@gpu2:~$[/codebox]

It also seems odd that there is only one nvidia entry in /dev, when in fact, lspci clearly shows the two Tesla C1060 cards. So shouldn’t there be an “nvidia1” /dev entry too?

[codebox]cyriac@gpu2:~$ ls /dev/nvidia*

/dev/nvidia0 /dev/nvidiactl

cyriac@gpu2:~$ lspci

02:00.0 3D controller: nVidia Corporation Unknown device 05e7 (rev a1)

03:00.0 3D controller: nVidia Corporation Unknown device 05e7 (rev a1)

0a:00.0 VGA compatible controller: ATI Technologies Inc RV 610LE PCI [Radeon HD 2400]

0a:00.1 Audio device: ATI Technologies Inc RV610 audio device [Radeon HD 2400 PRO][/codebox]

At this point, I dunno how to proceed. If you are familiar with the problems above, please do let me know.

Edit: Also note that I’ve switched back to the Radeon HD 2400. This one doesn’t appear to have the SBIOS issues that netllama mentioned, when I used a GeForce 6200.

Thanks,

Cyriac
cuda.sh.txt (1.67 KB)
xorg.conf.txt (2.09 KB)

Ubuntu ships 96.43.05. They has a ‘feature’ which reinstalls it & reconfigures X upon rebooting.

So how do I get around this ‘feature’. Does it involve recompiling the kernel?

Edit: I’ll try and follow these instructions and post on how that goes.

Ok. I managed to replace the driver modules shipped with Ubuntu with the current modules. But now I get this error:

[codebox]cyriac@gpu2:~$ ./NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery

NVIDIA: could not open the device file /dev/nvidia1 (No such file or directory).

cudaSafeCall() Runtime API error in file <deviceQuery.cu>, line 59 : initialization error.

cyriac@gpu2:~$

[/codebox]

Does anyone know why the following script that is called from /etc/rc.local does not create a /dev/nvidia1 entry for the second Tesla?

[codebox]

modprobe nvidia

Count the number of NVIDIA controllers found.

N3D=/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l

NVGA=/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l

Make /dev entries for each nvidia card in the system

N=expr $N3D + $NVGA - 1

for i in seq 0 $N; do

   mknod -m 666 /dev/nvidia$i c 195 $i

done

mknod -m 666 /dev/nvidiactl c 195 255

[/codebox]

Was /dev/nvidia1 created? If not, then you need to create it.

Thanks! I needed to debug my mod-probing script. lspci was at /usr/bin/lspci and not at /sbin/lspci. So it now adds both /dev entries correctly and I have CUDA apps running on the two Teslas (without starting X), while the Radeon 2400 is used for display. Surprisingly, I was able to ‘startx’ which properly loaded the ‘radeon’ drivers for the HD2400 and the ‘nvidia’ drivers for the Teslas. So awesome! :D

Maybe next I’ll try to load the fglrx drivers for the HD2400 in X. It might even be possible to run GL apps on the Teslas and extract and display images from its frame buffer. Yikes! But that’s all for another thread another time.

Thanks for all your help.