I recently bought a refurbished Tesla M2090 (money is very tight) and tried to use it in a Dell Workstation T5810
(E5-1650v3, Quadro K620 GPU, 16GB DDR4, 685Watts)
I do know that the specs are absolutely not perfect, but having the specs in mind it should work somehow.
After starting the PC and istalling the necessary Tesla driver [version 354.92] the system did recognize the gpu card and it looked fine.
But after maybe 20 min the system showed a error message window indicating that the M2090 is no removable device and can not be undocked.
I didn’t understand whats going on and restarted the system.
At first everything was fine once again, but after a few minute the error message showed up again.
Is there someone who can explain me what might be the problem? Or maybe even to solve it?
I would be thankful for any suggestions.
Your M2090 is overheating.
M2090 (and many other Tesla GPUs) are designed to be installed only in servers that are designed to accept those cards. In particular, the big heatsink on your M2090 has no fan. The heatsink alone is not enough to keep the M2090 temperature under control – it requires a closed loop temperature control involving forced air cooling from a server that is properly designed to accept the card, with appropriate BMC, temperature monitoring and control software in the server BMC, as well as appropriate server fans and air ducting to direct air through the M2090 heatsink.
If you search this forum you’ll find other examples of those who tried something similar and failed. Tesla GPUs that do not include the letter C (or c) are not designed to be plugged into any random computer and work correctly. In a nutshell, Tesla GPUs with an integrated fan are designed to keep themselves cool, such as C2075, K20c, K40c, etc. Any Tesla GPU without a built-in fan (usually an M or m product, such as K20m or M2075) depends on significant forced air cooling from the server to keep temperatures under control. A typical workstation will not provide anywhere near enough air movement, not to mention the fact that the airflow needs to be varied as the GPU load varies.
Unfortunately, I don’t think the temperature of the M2090 can be monitored via nvidia-smi unless it’s installed in a proper server, but you can try nvidia-smi to see if it reports the GPU temperature. if it does, you’ll see the temperature gradually rising up to 90C+ at which point the GPU will “drop off the bus” and no longer be usable, until you power off and cool down. As a further proof point, you could open your system and direct a powerful fan to blow air across the M2090 heatsink, which will probably cause it to “last” longer, but it may still overheat.
As I recall, M2090s were hot running parts, so I am doubtful a home-grown forced-air contraption would be able to keep this GPU sufficiently cool under full load. I have seen reports by “modders” who got M2090s to work with water cooling, so you may want to research that. As txbob explained, operating an M2090 outside of a server enclosure designed for such GPUs is an unsupported configuration and entirely at the risk of the user.
Thanks for this responses.
I see that this may has caused the problem.
Just out of curiosity, how fast does the temperature of a gpu core can change?
I mean when I have a look at GPU-Z (which shows me a temp value of approx. 50°C when not utilized by a program [But Im not so sure if this is the right]) the temperature changes rather slow as soon as utilization increases.
It can change fast (1 degree C increase per second or faster) depending on what the GPU is doing. Or it may change more slowly if the workload is lower or the GPU is idle.
I use an easy solution for the M2090 cooling and temperature monitoring. I have 12 of these some inside a server but 2 are in the plex 375, you can pic one up on ebay for less then 100 bucks. You can fit 2 m2090’s in them. The fan of the plex can be controlled by the computer and temperature monitoring can be viewed by hwinfo64