Tesla device problem Is it broken or it is just driver


I hope this is the right forum. We have a computer with 2 Tels cards running Ubuntu 10.04. We got a problem with one of them. Teh device query reports some stranger numbers and then it crashes:

[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2070"

  CUDA Driver Version / Runtime Version          4.1 / 4.0

  CUDA Capability Major/Minor version number:    2.0

  ( 0) Multiprocessors x (32) CUDA Cores/MP:     0 CUDA Cores

  Max Texture Dimension Size (x,y,z)             1D=(65535), 2D=(2048,2048), 3D=(0,512,0)

  Max Layered Texture Size (dim) x layers        1D=(1) x 6, 2D=(0,0) x 0

  Concurrent copy and execution:                 No with -725995520 copy engine(s)


  Device PCI Bus ID / PCI location ID:           -722875584 / 32669

  Compute Mode:

Segmentation fault

As you can see i detects 2 cards, but it shows that the there are 0 MP, the texture size is 0 there are negative num,ebr of copy engines and the PCI Bus Id is all messed up. We have the cudatoolkit 4.1 and the nvidia driver x86_64-285.05.33

It is also weird, because sometimes my programs run on device 0 and sometimes they do not run.

We did not seem to have this problem a few days ago before upgrading from 4.0 to 4.1. Is it jsut the driver or is the card burned?

Try to reinstall 4.1, you are still using the 4.0 runtime

CUDA Driver Version / Runtime Version 4.1 / 4.0

Also, be sure to recompile the examples.

Thanks. Is there anything special needed to be done in order to remove the previous runtime? Is the cudatoolkit installation script going to uninstall the previous version?

I usually move the old version manually ( mv /usr/local/cuda /usr/local/cuda_4.0) and then install the new version.
In this way, I can easily use older toolkits by changing the PATH and LD_LIBRARY_PATH or using modules.