Basically, when the S870 is connected and the nvidia driver is set in xorg.conf, X takes something like 10 minutes to start (went to check these forums, came back and it was working). I was running the exact software environment with the C870 with no issues. Once X finally starts the S870 works as expected, all 4 GPU’s are visible.
If I disconnect the cables or set the xorg.conf to use the default X driver ‘nv’, X starts normally. There are no obvious error messages in the xorg log, but there are several in /var/log/messages . Unfortunately I can’t readily upload the results from the nvidia bug report, but I can print it and scan a pdf if needed.
A few other relevant pieces of info is that I’m using a supermicro X7DWA-N motherboard based on the Intel 5400 chipset. Since the S870 host interface cards are taking up both PCIe slots, I’m using a PCI based Geforce 6200 (I think) for video. I could easily be way off, but for some reason I think the driver is trying to use the S870 GPU’s for video and then failing, and finally defaulting back to the correct PCI video card, causing the 10 minute plus delay.
Another clue is that once I’m finally in X and try to run nvidia-smi, it prints some GPU found messages to the terminal then hangs the system. Even ssh doesn’t respond. I haven’t waited around to see if it clears. As a work around, I’ve thought about using the nv driver in xorg, then executing the script found here
[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA
to make the S870 available for CUDA. Any other ideas for a fix/workaround?
I should have mentioned that I have flashed the bios with the latest version. And I’ll try to upload a pdf of the nvidia crash report tomorrow. This definitely seems like some sort of driver issue, hopefully the crash log will shed some light.
I’ve included the pdf of the log. Sorry, I know the format is a pain but hopefully it’s usable.
A few highlights that I’ve noticed,
On page 16 of the pdf, right column there are 4 distinct messages that say PCI: Failed to allocate mem resource
2)On page 18 of the pdf there are 4 distinct instances of “BUG: Soft lockup detected on CPU#0” (or CPU#1) followed by a call trace
I’m having trouble attaching the pdf, so I uploaded it to rapidshare (3MB file)
It took me a minute to figure out that you want me try the beta 174.55 drivers. I have not tried them yet, but will right now. Thanks for your help, I’ll keep you posted.
Basically, I can’t take electrons out of the lab that the S870 is in. The best I can do is print the log and scan it. I tried attaching the pdf in these forums, but after several minutes I get redirected to a blank page. Hopefully the file hosting service will come through so you can see if there is anything obvious in the log.
Studying the bug report log, I noticed that the time between the four “BUG: Soft lockup detected on CPU#0” messages is almost exactly 128 seconds. While these messages are being recorded (1 for each GPU) the system is almost non responsive.
Has anyone successfully installed the drivers with the S870 while using feisty. I’m curious if there’s a known working kernel version.