Compute Card Fails when switching jobs

Does anyone know of a logging method that might capture the problem that I am having with my Tesla M2090 card. It loads jobs from the program, completes the job then when a new one is assigned within the program the card fails. The process just stops, and its always when it completes the first assigned task and movers on to the next one. this time it completed 33.033% this time but that changes on every attempt. What ever is occurring, the card is removed from device manager, Nvidia control panel will not load as it says there is no card to manage. This isn’t isolated to windows as I have multiple OS boot and can load into my ubuntu drive and run the Linux version of the software with the same results, dosent matter the length of the job it will always complete the first run and will fail after loading in the next task. I cant tell if it is failing when it is reporting the results of the task completed. I was thinking maybe a driver issue, but I cannot get any driver to recognize the card except 390.77 for Linux and 386.45 for windows. Even a restart will not resolve the issue as the system has to be powered down completely for the device to show back up on the machine.

If anyone could suggest some logging or solutions, as I am using this card for my graduate degree for my programming classes and would really like to avoid replacing it, and the troubleshooting provides valuable information. can provide any information needed if will help with feedback

sounds like overheating

try monitoring temperature of the card using nvidia-smi while the jobs are running.

[Apparently I am typing too slowly once again. Please excuse redundancies with post #2, which wasn’t there when I started typing.]

What kind of system is this GPU installed in? The Tesla M2090 is a passively-cooled card designed to be installed in a server where air is forced across its heat sink by the high-speed fans in the chassis. The integration into a server is supposed to be handled by qualified system integrators.

From your description, which is not entire clear, it sounds like your M2090 may be overheating due to insufficient cooling. Alternatively, or additionally, it may also suffer from inadequate power supply. Make sure the two 8-pin PCIe power connectors are hooked up properly and supplied by a sufficiently powerful PSU.

thanks for your reply,
I was aware of the passive cooling only on this model card,prior to install I replaced this I replaced the heatsink with an active cooling system that includes a new heatsink, mounted fans & backside cooling, I had considered it might not be enough but it fails only after the first job is completed and starts the 2nd as I had tried different length tasks with the same results, Task running from as low as 5 mins to as high as .75 hrs with the same results and during last test Load on the GPU never went over 86% and temp never went higher than 81C which is well within the operating temp of the card. Granted I cannot get nvidia-smi to continuously report the temps so I can only get the last one I manually check. and as the card is removed from devices I cannot check it when it fails as. IS there an alternate temp logging program that can be used to take and record temps so the can be an accurate measure for when the device fails? Is there something else I am missing? If it was failing due to an error switching jobs wouldn’t it be at the start of the 2nd task as opposed to during a random completion range on that run?

System card being run on:
32gb ram
750w PSU
Dual Boot windows server 2019 & Kubuntu 18.04 LTS
OS’s installed on Separate SSD’s
Kubuntu driver 390.77
Windows Driver 386.45
thanks for any information and taking the time to respond

Based on your system specifications, and assuming that the Tesla M2090 is the only GPU in the system, the PSU’s power rating seems adequate. I made a mistake before: This GPU draws 250W at most, so should have 6-pin plus 8-pin PCIe auxiliary power connector.

There is no way for me to tell whether your home-brew cooling solution is adequate. There could be hot spots that aren’t reflected in the temperature sensor readout. I don’t recall what the maximum operating temperature is for this GPU, but 81 deg C is close to what the maximum operating temperature is for most modern GPUs (it is 82 deg C for my Quadro P2200, for example).

The most likely diagnosis based on the data provided is still that the device is overheating and shuts off to prevent permanent damage. nvidia-smi can be configured to report data continuously, once per specified interval. Under Windows, you can use TechPowerUp’s GPU-Z for continuous reporting.

so after some continuous testing I figured out that the Pci controller was failing for some reason on the slot it was in, I changed the slot it was in and with the adjusted cooling solution the card is processing jobs correctly. However I was wondering if any could recommend a rack mounted server model that would be a good fit for multiple cards of this type, I decided given how much extra space is taken up by just one of the cooling solutions its impractical to build bigger compute node in this format, but am not sure on a server board that has multiple pci e slots for the cards in the rack sized case. I have found Full ATX boards that can do it but I would like to avoid building what amounts to just a big box
thanks for everyone taking the time to comment

I have never encountered such a failure mode, which doesn’t mean it couldn’t happen. When PCIe slots “go bad”, it is usually due to dirty or damaged connectors, either in the slot or on the card. Theoretically it is also possible for the traces connecting the controller chip and the slot to be damaged, e.g. scratched by tool use.

Permanent physical damage can occur at the springy connector fingers in the slot as a consequence of operating heavy PCIe devices such as GPUs without proper mechanical support by means of the mounting bracket, or when inserting PCIe cards at an angle. Intermittent connector failures can occur due to vibration, e.g. on ships or near heavy machinery.

I to have never run into an issue with a pci slot like this before, prior to this event any problem I had with a card slot was all or nothing, either the hardware in the pci slot worked or it didnt. Then again this is the first time I have started working with compute cards for this type of application. While the exact problem is unidentified, I verified that it is causing the problem as I have been running the M2090 in the alternate slot for a couple days now continuously without a problem. I moved the card back to the initial slot and started a test, it again completed the first job with out a problem, however 27.016% into the next job the card failed and dropped from the bus completely. Load on the card never went higher than 90%, constant running temp was only 76c. Returning the card to the 2nd pci slot, and restarting again resolves the issue and it has been running without a problem since then. this is unlike any hardware issue i have ever run into and is outside and skills of being able to track the processes to identify the failure and where it is occurring.

Independent of root cause, your method of resolving this is of course one we typically recommend here: cycle GPU(s) through PCIe slots to check whether the problem is correlated with a particular card or a particular slot.