Serious driver problems with v177.1x

I am a bit frustrated right now.

After having a lot of problems with the driver v177.11 on WindowsXP I have set up a 64bit Linux machine with driver v177.10 today.
And guess what, it is as bad as on the Windows machine.

Please Nvidia, I have already posted bug reports on the issue. (Pending Review)
The driver locking up the machine is a reproducible cross platform behaviour.

(CUDA SDA 2.0b: alignedTypes.exe etc.)

I cannot get any of my simulations running on the T10P without semi-random system lock ups.

Any recomendations on the most stable system configuration?

If I were you I would check if you have enough power to drive the card.
Also I believe reading Lonni writing the OpenGL interop is a shady area since the T10P’s have had the 3D turned off on the card. Are non-OpenGL samples stable?

I even added an additional power supply last week feeding 250W exclusively to the Tesla card.

So there is enough power to drive the card.

Please try the “alignedTypes” sample from the CUDA 2.0b SDK.

This is a very simple non OpenGL sample that consistently locks up the system.

I tortured the Tesla card with the CUDA N-Body simulation and it is rock solid.

So the hardware seems to be fine.

Personally I have a T10P without graphical out (and 8800GTX next to it for graphics), and I only had the Mandelbrot example lock up my system, N-body & particles were running fine. I’ll try alignedtypes on monday to see if I have the same problem.

Did you recompile with -arch sm_13 ? I think I did not do that.

Manfred

Is the T10P driving the display or do you have another graphics card driving the display?

If you using only the T10P, on Windows roll back to the old driver (177.03) – see if that works.

Sumit

Hello Sumit!

I am running my machine with the T10P (no display connectors) and a Quadro FX1700.

I have also tried a Geforce 8800GT, but it makes no difference.

No luck with the 177.03 driver!

I made some progress with my simulation code by extensively using ThreadSynchronize().

For the small simulation examples the code works most of the times for the first run and maybe the second run, but in the third or fourth run it looks up the machine, blue-screens, or spontaneously reboots.

For the big simulations examples the code locks up during the first run after a couple of iterations in the algorithm.

I am getting errors like: “launch timed out and was terminated” before the machine locks up.

It seems that in the driver some bad stuff is accumulating and finnaly leads to the lock up.

Sometimes the simulation code stutters for a few iterations before it dies.

This behaviour is nearly identical under Windows and Linux.

If this problem is not confined to my card/machine you can easily reproduce this instability by running the CUDA SDK sample “alignedTypes”.

Manfred

Just to add in some of my experiences with a T10P sample running on a headless Ubuntu Hardy box. I know Ubuntu is not supported, but thought this might be interesting for other users as well. I am running 177.10 on x86_64

The few test I have run seems to work okay, but the machine randomly locks up, even when not using CUDA. Typically the only trace of the error is the following line in /var/log/kern.log:

May 29 10:38:40 triplex kernel: [88072.245928] console-kit-dae[8795]: segfault at 0 rip 7f496da7def0 rsp 7fff769b1928 error 4

This seems to be very close to Bug #208605 for Ubuntu Hardy, which people suspect arise because of the latest, official, Nvidia drivers. (Based on the thread http://ubuntuforums.org/archive/index.php/t-770382.html.)

Just my 0.2$