Strange freezes with Tesla C2050 - Help needed! Help needed!!!!

Dear All,

I have just installed a Tesla C2050 our research group has received in NVIDIA Professor Partnership Program. But it does not seem to work correctly, it freezes every time I start specific programs, while others seem to work correctly. I tried to collect every data that could be relevant. If something would be still missing, please just ask me. I have also attached a photo about the computer (with side opened), a close-up about the GPUs, and photos about 2 different BlueScreen Death, it might help. Every need is very much appreciated, please help me!

Computer: Custom Assembled
Motherboard: ASUS P6T SE
Processor: Intel Core i7 975 Extreme (no overclocking)
Memory: 6x2 GB GEIL DDR3 (PC3-14400U)
Power supply: Chieftec 750 W
GPU: NVIDIA GTX295 1.8 GB (slot 1) + NVIDIA Tesla C2050 (slot 2)
OS: Windows Vista Business x64 SP2
Driver: NVIDIA 260.81 (Tesla version, SLI disabled)
CUDA Toolkit: 3.2
CUDA SDK: 3.2

And finally what I have found about it until now. When using e.g. fluid (DirectX), everything works fine (Computing on: C2050, Display on: GTX295). When using nbody, usually after 10 s everything freezes (no cursor movement, no ctrl+alt+del), and in 50% percent I have to restart the computer, in 50% the BlueScreen appears. DeviceQuery works fine.
I have also tested other programs (the included DirectCompute ones). All of them run fine on the GTX295, but when changing to the C2050, all of them freeze and the computer collapses. Another interesting detail was that only for a few seconds, but I measured around 366 GFlops with the C2050, while the GTX295 measured 400-460 GFlops (strange). I have tested the programs both with ECC on and off.

I have also tested the GPUs with Sisoft Sandra, it has never collapsed (nor on CUDA, OpenCL or Shader language). But in some cases the performance seemed to be suspiciously low. It almost never performed on the top between the different GPUs. We also measured the temperatures with a thermo-camera, the average temperatures were 55-65C on the surface of the cards (usually on the metal plates, or close to them).
I hope there will be some kind of solution for this problem and we can start using the Tesla card! We already have it now for two months, but I’m starting to loose my trust in it as I simply cannot use it :(.

                       Thank you for your help in advance!

                                                        Yours sincerely:
                                                                 Xoranus

Dear All,

I have just installed a Tesla C2050 our research group has received in NVIDIA Professor Partnership Program. But it does not seem to work correctly, it freezes every time I start specific programs, while others seem to work correctly. I tried to collect every data that could be relevant. If something would be still missing, please just ask me. I have also attached a photo about the computer (with side opened), a close-up about the GPUs, and photos about 2 different BlueScreen Death, it might help. Every need is very much appreciated, please help me!

Computer: Custom Assembled
Motherboard: ASUS P6T SE
Processor: Intel Core i7 975 Extreme (no overclocking)
Memory: 6x2 GB GEIL DDR3 (PC3-14400U)
Power supply: Chieftec 750 W
GPU: NVIDIA GTX295 1.8 GB (slot 1) + NVIDIA Tesla C2050 (slot 2)
OS: Windows Vista Business x64 SP2
Driver: NVIDIA 260.81 (Tesla version, SLI disabled)
CUDA Toolkit: 3.2
CUDA SDK: 3.2

And finally what I have found about it until now. When using e.g. fluid (DirectX), everything works fine (Computing on: C2050, Display on: GTX295). When using nbody, usually after 10 s everything freezes (no cursor movement, no ctrl+alt+del), and in 50% percent I have to restart the computer, in 50% the BlueScreen appears. DeviceQuery works fine.
I have also tested other programs (the included DirectCompute ones). All of them run fine on the GTX295, but when changing to the C2050, all of them freeze and the computer collapses. Another interesting detail was that only for a few seconds, but I measured around 366 GFlops with the C2050, while the GTX295 measured 400-460 GFlops (strange). I have tested the programs both with ECC on and off.

I have also tested the GPUs with Sisoft Sandra, it has never collapsed (nor on CUDA, OpenCL or Shader language). But in some cases the performance seemed to be suspiciously low. It almost never performed on the top between the different GPUs. We also measured the temperatures with a thermo-camera, the average temperatures were 55-65C on the surface of the cards (usually on the metal plates, or close to them).
I hope there will be some kind of solution for this problem and we can start using the Tesla card! We already have it now for two months, but I’m starting to loose my trust in it as I simply cannot use it :(.

                       Thank you for your help in advance!

                                                        Yours sincerely:
                                                                 Xoranus

P.S.: We have also tested on Windows 7 x64, with same results…

P.S.: We have also tested on Windows 7 x64, with same results…

Potentially bad power supply or motherboard.

Potentially bad power supply or motherboard.

I have thought about it, but if the motherboard would be faulty, than why is the GTX295 working fully correctly, and the C2050 also correct with some of the programs (e.g. fluids DirectX/OpenGL), while others freeze immediately?

(Just for info, the computer is one year old, and since the acquisition, we have been using it without a stop on full capacity. I think such a long stressing and testing time would have shown the faults.)

  For the power supply as far as I know 750W should be more than enough (especially because during the tests only the C2050 was in use), but again the question remains why does it happen only at a couple of programs? If the problem would be that e.g. the power supply cannot give enough power to the card, than why is it able to give enough e.g. during tests done by SiSoft Sandra, etc.? Than it should collapse anyway.

  And again, if the motherboard would be faulty or the power supply, than why is the mode of the freeze differing from time to time (sometimes blue-screen without information, sometimes blue-screen with informations, sometimes only the windows stops responding, sometimes it becomes only really slow)?

I have seen previously freezes due to the power-supply, they simply shut down…

The card was just installed 2 weeks ago into the computer, but previously it was working without any problems. If no program uses it still works OK (in the last two weeks the processor and memory was still all the time under load).

P.S.: Before the card freezes, it still reports the last frames from n-body, and is usually only 360 GFlops, which seems to me especially low .

I have thought about it, but if the motherboard would be faulty, than why is the GTX295 working fully correctly, and the C2050 also correct with some of the programs (e.g. fluids DirectX/OpenGL), while others freeze immediately?

(Just for info, the computer is one year old, and since the acquisition, we have been using it without a stop on full capacity. I think such a long stressing and testing time would have shown the faults.)

  For the power supply as far as I know 750W should be more than enough (especially because during the tests only the C2050 was in use), but again the question remains why does it happen only at a couple of programs? If the problem would be that e.g. the power supply cannot give enough power to the card, than why is it able to give enough e.g. during tests done by SiSoft Sandra, etc.? Than it should collapse anyway.

  And again, if the motherboard would be faulty or the power supply, than why is the mode of the freeze differing from time to time (sometimes blue-screen without information, sometimes blue-screen with informations, sometimes only the windows stops responding, sometimes it becomes only really slow)?

I have seen previously freezes due to the power-supply, they simply shut down…

The card was just installed 2 weeks ago into the computer, but previously it was working without any problems. If no program uses it still works OK (in the last two weeks the processor and memory was still all the time under load).

P.S.: Before the card freezes, it still reports the last frames from n-body, and is usually only 360 GFlops, which seems to me especially low .

I too am a little worried about a 130W CPU, 289W GPU, and 238W GPU on a 750W PSU even if not all those devices should be drawing peak power at the same time. The 12V rails might be close to their limit for some reason. (A 750W power supply does not usually supply all that exclusively to the 12V rails, even though that’s where the CPU and GPU draw most of their power.)

To test this theory, can you pull the GTX 295 and just use the C2050 for compute and display? I think most of the demo apps should run fine, including n-body. If the GPUs run OK installed separately, then you might want to do the math and check cabling to see how your peak current draw is being distributed over the 12V rails of the power supply.

I too am a little worried about a 130W CPU, 289W GPU, and 238W GPU on a 750W PSU even if not all those devices should be drawing peak power at the same time. The 12V rails might be close to their limit for some reason. (A 750W power supply does not usually supply all that exclusively to the 12V rails, even though that’s where the CPU and GPU draw most of their power.)

To test this theory, can you pull the GTX 295 and just use the C2050 for compute and display? I think most of the demo apps should run fine, including n-body. If the GPUs run OK installed separately, then you might want to do the math and check cabling to see how your peak current draw is being distributed over the 12V rails of the power supply.

Thanks for the reply I will have a look on it and try it!

Although I still find it strange, that n-body freezes allways after around 10 seconds. So why is the power enough for the first 10, and why does it freeze afterwards. I tested it also on Windows 7. With it there is no BlueScreen death, only the computer stops responding after 10 second (with the mouse still working), after another 10 seconds also the mouse stops responding…

Thanks for the reply I will have a look on it and try it!

Although I still find it strange, that n-body freezes allways after around 10 seconds. So why is the power enough for the first 10, and why does it freeze afterwards. I tested it also on Windows 7. With it there is no BlueScreen death, only the computer stops responding after 10 second (with the mouse still working), after another 10 seconds also the mouse stops responding…

It’s really hard to speculate. Keep in mind that the ratings on power supplies are not rigidly defined limits. As you get close to the maximum, the power supply can respond in a variety of ways, including sagging voltages or outright shutdown. You might “just” have a software problem, but unreliable power can do weird, weird things.

It’s really hard to speculate. Keep in mind that the ratings on power supplies are not rigidly defined limits. As you get close to the maximum, the power supply can respond in a variety of ways, including sagging voltages or outright shutdown. You might “just” have a software problem, but unreliable power can do weird, weird things.

Dear Xoranus,

I have a similar problem history with a Tesla C2070 card.
The power supply (Corsair HX1000W) was happily feeding an old Tesla C1060, but the new card, in the same PCI-E slot, caused hang-ups, CUDA code and X Server (openSUSE 11.2 x86_64) failures, be it alone in the machine or paired with a Quadro FX 580 for display.
Motherboard: Gigabyte GA-X58A-UD7 (rev. 1); CPU: Intel Core i7 950.
My NVidia dealer was puzzled and I’m still waiting for a clarifying answer.
If I needed to bet on the cause (leaving the card itself aside), I’d listen to tmurray and suspect the motherboard has some glitches, but I still have no clue why Tesla C1060 worked (almost) flawlessly.

Dear Xoranus,

I have a similar problem history with a Tesla C2070 card.
The power supply (Corsair HX1000W) was happily feeding an old Tesla C1060, but the new card, in the same PCI-E slot, caused hang-ups, CUDA code and X Server (openSUSE 11.2 x86_64) failures, be it alone in the machine or paired with a Quadro FX 580 for display.
Motherboard: Gigabyte GA-X58A-UD7 (rev. 1); CPU: Intel Core i7 950.
My NVidia dealer was puzzled and I’m still waiting for a clarifying answer.
If I needed to bet on the cause (leaving the card itself aside), I’d listen to tmurray and suspect the motherboard has some glitches, but I still have no clue why Tesla C1060 worked (almost) flawlessly.

Yes, that’s true, but - theorically - the supply shouldn’t work close at all to the peak performance (the GTX295 card was idle during the tests). And still I have no idea why does it happen with only some of the programs - and only after exactly 10 seconds -, while during the most demanding tests (using both GTX290+C2050) it does not fail… Anyway, thanks for the suggestion. As this is our only possible way to start the investigation we have agreed with a local computer repair service to test it with only one card/different PCI Express places and also to measure the voltages and performances during the crashes next week. It might give a clue. I will keep you informed about the results if you are interested.

Please inform me if you get any useful answer from them. I haven’t received any from NVIDIA yet. Personally, I would find it strange if it were the motherboard’s problem. Just think of it, maybe it’s possible that one motherboard had a hidden fault in it, so it did not came out just at the C2050. But more motherboards have similar hidden faults and only with the Tesla C2050??? I guess also your computer was usually under full load previously, so I think it should have been a good stress test for it if it had any weakness :)) . And BTW at my computer the problem comes always after 10 seconds, so the motherboard fails only after 10 seconds??? Maybe I’m mistaken, but I find these things very strange. My guess would be rather overheating as it would explain the 10 seconds (it requires time to heat up), and also that only some of the programs fail (not all of them use the card at full occupancy). But to say the truth it is rather guessing only as I do not have the possibility to change every part of the computer one by one to a new one to localize the problem…

Yes, that’s true, but - theorically - the supply shouldn’t work close at all to the peak performance (the GTX295 card was idle during the tests). And still I have no idea why does it happen with only some of the programs - and only after exactly 10 seconds -, while during the most demanding tests (using both GTX290+C2050) it does not fail… Anyway, thanks for the suggestion. As this is our only possible way to start the investigation we have agreed with a local computer repair service to test it with only one card/different PCI Express places and also to measure the voltages and performances during the crashes next week. It might give a clue. I will keep you informed about the results if you are interested.

Please inform me if you get any useful answer from them. I haven’t received any from NVIDIA yet. Personally, I would find it strange if it were the motherboard’s problem. Just think of it, maybe it’s possible that one motherboard had a hidden fault in it, so it did not came out just at the C2050. But more motherboards have similar hidden faults and only with the Tesla C2050??? I guess also your computer was usually under full load previously, so I think it should have been a good stress test for it if it had any weakness :)) . And BTW at my computer the problem comes always after 10 seconds, so the motherboard fails only after 10 seconds??? Maybe I’m mistaken, but I find these things very strange. My guess would be rather overheating as it would explain the 10 seconds (it requires time to heat up), and also that only some of the programs fail (not all of them use the card at full occupancy). But to say the truth it is rather guessing only as I do not have the possibility to change every part of the computer one by one to a new one to localize the problem…

I’ll keep you informed if something useful arrives.

BTW, Tesla C2070 had a more random failure pattern, not exactly after 10 seconds.

The machine with old Tesla C1060 was under heavy load for weeks, 24/7, with close to 100% GPU utilization.

Have you, by any chance, access to another machine with a decent power supply, to plug both cards into it and test there?

I was unable to try that because fiddling with the University’s machines could get too many people too nervous.

I’ll keep you informed if something useful arrives.

BTW, Tesla C2070 had a more random failure pattern, not exactly after 10 seconds.

The machine with old Tesla C1060 was under heavy load for weeks, 24/7, with close to 100% GPU utilization.

Have you, by any chance, access to another machine with a decent power supply, to plug both cards into it and test there?

I was unable to try that because fiddling with the University’s machines could get too many people too nervous.