S870 installation woes

I’m having problems getting a S870 up and running. I’m running 64 bit ubuntu 7.04 with the 171.06.01 driver. The problem is very similar to this post

http://forums.nvidia.com/index.php?showtopic=57212&hl=

and might be similar to this post

http://forums.nvidia.com/index.php?showtopic=65281&hl=s870

Basically, when the S870 is connected and the nvidia driver is set in xorg.conf, X takes something like 10 minutes to start (went to check these forums, came back and it was working). I was running the exact software environment with the C870 with no issues. Once X finally starts the S870 works as expected, all 4 GPU’s are visible.

If I disconnect the cables or set the xorg.conf to use the default X driver ‘nv’, X starts normally. There are no obvious error messages in the xorg log, but there are several in /var/log/messages . Unfortunately I can’t readily upload the results from the nvidia bug report, but I can print it and scan a pdf if needed.

A few other relevant pieces of info is that I’m using a supermicro X7DWA-N motherboard based on the Intel 5400 chipset. Since the S870 host interface cards are taking up both PCIe slots, I’m using a PCI based Geforce 6200 (I think) for video. I could easily be way off, but for some reason I think the driver is trying to use the S870 GPU’s for video and then failing, and finally defaulting back to the correct PCI video card, causing the 10 minute plus delay.

Another clue is that once I’m finally in X and try to run nvidia-smi, it prints some GPU found messages to the terminal then hangs the system. Even ssh doesn’t respond. I haven’t waited around to see if it clears. As a work around, I’ve thought about using the nv driver in xorg, then executing the script found here
http://forums.nvidia.com/index.php?showtopic=63948&hl=s870
to make the S870 available for CUDA. Any other ideas for a fix/workaround?

Without seeing an nvidia-bug-report.log which captures the problem, no one can speculate on what is happening on your system.

What you should do, however, is verify that you’re using the latest motherboard BIOS, and test with the CUDA_2.0 driver.

I should have mentioned that I have flashed the bios with the latest version. And I’ll try to upload a pdf of the nvidia crash report tomorrow. This definitely seems like some sort of driver issue, hopefully the crash log will shed some light.

I’ve included the pdf of the log. Sorry, I know the format is a pain but hopefully it’s usable.

A few highlights that I’ve noticed,

  1. On page 16 of the pdf, right column there are 4 distinct messages that say PCI: Failed to allocate mem resource
    2)On page 18 of the pdf there are 4 distinct instances of “BUG: Soft lockup detected on CPU#0” (or CPU#1) followed by a call trace

I’m having trouble attaching the pdf, so I uploaded it to rapidshare (3MB file)

http://rapidshare.com/files/118297868/2811_001.pdf.html

Unfortunately, RapidShare claims that I’ve reached the download limit for free users, even though I’ve not downloaded anything.

My best guess is that the soft locked up on the CPU is where the problem lies.

Have you tested with the CUDA_2.0 beta driver?

I tried a different file hosting service. Maybe this one will work,
http://www.filecrunch.com/fileDownload.php…a&fileId=151673

It took me a minute to figure out that you want me try the beta 174.55 drivers. I have not tried them yet, but will right now. Thanks for your help, I’ll keep you posted.

I tried the beta 174.55 drivers, but with no luck, same behavior as before.

Why can’t you attach the original nvidia-bug-report.log here?

Basically, I can’t take electrons out of the lab that the S870 is in. The best I can do is print the log and scan it. I tried attaching the pdf in these forums, but after several minutes I get redirected to a blank page. Hopefully the file hosting service will come through so you can see if there is anything obvious in the log.

Try zipping the pdf. The forums are a little picky about what file types can be attached.

Still no luck with a zip file, maybe the pdf file size is too large? Hopefully someone in the know can decipher the pdf I uploaded.

Studying the bug report log, I noticed that the time between the four “BUG: Soft lockup detected on CPU#0” messages is almost exactly 128 seconds. While these messages are being recorded (1 for each GPU) the system is almost non responsive.
Has anyone successfully installed the drivers with the S870 while using feisty. I’m curious if there’s a known working kernel version.