TESLA M2070 troubleshooting. Probably broken hardware components?

Hi Folks.
Short intro: I am the co-owner of an architectural company that also does 3d stuff at a very high level mostly for other architects. I decided early this year to build a TESLA workstation for GPU rendering. As this was meant to be a test rig, I did not want to spend to much money on the TESLAs itself, so I bought a secondhand card from an Italian guy over EBAY. That card is working properly. More later. TESLA was a must, because we need that 6GB of RAM for that high count polygon scenes and big frame buffers we do normally while rendering (trees, grass, big textures etc)
Ok here are the specs of the beast:

main system:
-Motherboard: Asus P6T7 WS SuperComputer
-Proc: Intel Core i7 980
-PSU: Enermax Platimax - 1500 W
-Memory: 24 GB DDR3-1333
-Harddisk System: SSD 128GB
-Harddisk Data: 2TB, SATA-3

graphic system:
-1x QUADRO FX 4800 (my old one ripped from my previous WS)
-1x secondhand TESLA M2070 6GB bought from an Italian guy over EBAY (800 EUR)
-1x secondhand TESLA M2070 6GB bought from an French guy fraud over EBAY (1300 EUR, “fraud” because I bought TWO cards at this price and received ONE probably broken one. omg)

os: Win7 x64

all that stuff is placed in a Silverstone SST-FT02B which I modded so there is place for 4 GPUs

You see: there should be 3 Teslas in that rig but actually there are 2 and one is making troubles…
Ok with the Italian card i first ran into thermal issues because the Silverstone Case was not able to cool down the passively cooled M2070 enough so I decided to cool down only the GPU with water since there are no full cover blocks for that kind of cards available on the market. All went fine. My system was stable and we did some amazing rendering work on that one card. Having successfully tested that thing it was time to get another two of Teslas into the system. But I only received one from that French fraud.
I modded the card the same as the Italian one with a water block on the GPU and bound it into the system.
Then my problems started. The problems occur occasionally, mostly after a resolution change or material change so the renderer has to restart the computing process. the failure mostly leads to complete system freeze where i just can push the start button to reset the system. sometimes I can stop the render process with task manager, however, nvidia-smi is getting an initialize error after that and i have to reboot my machine.

(shortened out:)
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe -q
==============NVSMI LOG==============
Timestamp : Sat Mar 31 15:06:01 2012
Driver Version : 286.19
Attached GPUs : 3
GPU 0000:08:00.0
Product Name : Tesla M2070
Serial Number : 0000000000000 <— that’s the working card
Bus : 0x08
Gpu : 99 %
Memory : 16 %
GPU 0000:09:00.0
Product Name : Tesla M2070
Serial Number : 0323810024154 <— that’s the problem child
Bus : 0x09
Gpu : 99 %
Memory : 0 % <-------------------------- after the system getting unstable
GPU 0000:04:00.0
Product Name : Quadro FX 4800
Bus : 0x04

BTW both cards passed
nbody --benchmark --n=131072 --device=N
bandwidthTest --memory=pinned --device=N

my troubleshooting steps:

  1. I tested different drivers, switched the cards from non-ECC support to ECC and back again etc.

  2. then I thought it could be another cooling issue, since the back plate of the tesla is a thin aluminum piece covering the ram on the backside of the card and is getting hot, but i can touch it a couple of seconds without burning my fingers. also the card seemed to fail after a couple of minutes and not immidiatly. so i decided to basically screw some copper tube on both the front and backside cover of the faulty card to also have the ram cooled in my water circuit. See attached images.
    now temps are like the following:
    the gpu cores are around 45-50 °C, the water 30-35 °C at 100% computing, the front and back plate of the faulty card is really cool now thanks to the coppery water pipe flowing over the ram block on both sides…

  3. I swapped power supply cables

  4. I isolated the faulty card at the same PCI slot where the working card was.

the result: nothing changed in the behavior of the card. the memory and GPU are at really low temperature so there must be something other.
any suggestions?
is there another tool for testing a TESLA or only the RAM of the card. like I said, nbody and bandwidthTest from the CUDA sdk made no problems, however, the nbody isn’t actually using much memory…
could it be just a capacitor?
any suggestions highly appreciated.

and I know since the knew gtx680 with 4gb of ram will be available soon this might be an option, but better having 6 than 4 gb…

anyone? this card is driving me sick :(

Is there an option to memcpy 6GB in the card,increment them and then get them out of the card and check if the results are correct?
Do you have enough RAM?I guess you do on a clean system.

tnx for the answer.

Ram should not be the problem (24Gb). I installed cuda sdk however i am not really a programmer except some html flash stuff but yeah isn’t there an app around which does something like that?

oh yeah. tnx for that. found a windows version of it. i will do some tests as soon i can reach the card again (have to do some customer work on the working config…).

and tnx for not putting in the line “google is you friend” hihi