Problem with M2090

Envidoc · June 9, 2016, 12:13pm

Hi everyone,

I recently bought a refurbished Tesla M2090 (money is very tight) and tried to use it in a Dell Workstation T5810
[url]http://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell_Precision_Tower_5810_Spec_Sheet.pdf[/url].
(E5-1650v3, Quadro K620 GPU, 16GB DDR4, 685Watts)

I do know that the specs are absolutely not perfect, but having the specs in mind it should work somehow.

After starting the PC and istalling the necessary Tesla driver [version 354.92] the system did recognize the gpu card and it looked fine.
But after maybe 20 min the system showed a error message window indicating that the M2090 is no removable device and can not be undocked.
[url]http://s33.postimg.org/gvwx6kyhb/Nvidia_Error.png[/url]
I didn’t understand whats going on and restarted the system.
At first everything was fine once again, but after a few minute the error message showed up again.

Is there someone who can explain me what might be the problem? Or maybe even to solve it?

I would be thankful for any suggestions.

Best regards,
Envidoc

Robert_Crovella · June 10, 2016, 1:48am

Your M2090 is overheating.

M2090 (and many other Tesla GPUs) are designed to be installed only in servers that are designed to accept those cards. In particular, the big heatsink on your M2090 has no fan. The heatsink alone is not enough to keep the M2090 temperature under control – it requires a closed loop temperature control involving forced air cooling from a server that is properly designed to accept the card, with appropriate BMC, temperature monitoring and control software in the server BMC, as well as appropriate server fans and air ducting to direct air through the M2090 heatsink.

If you search this forum you’ll find other examples of those who tried something similar and failed. Tesla GPUs that do not include the letter C (or c) are not designed to be plugged into any random computer and work correctly. In a nutshell, Tesla GPUs with an integrated fan are designed to keep themselves cool, such as C2075, K20c, K40c, etc. Any Tesla GPU without a built-in fan (usually an M or m product, such as K20m or M2075) depends on significant forced air cooling from the server to keep temperatures under control. A typical workstation will not provide anywhere near enough air movement, not to mention the fact that the airflow needs to be varied as the GPU load varies.

Unfortunately, I don’t think the temperature of the M2090 can be monitored via nvidia-smi unless it’s installed in a proper server, but you can try nvidia-smi to see if it reports the GPU temperature. if it does, you’ll see the temperature gradually rising up to 90C+ at which point the GPU will “drop off the bus” and no longer be usable, until you power off and cool down. As a further proof point, you could open your system and direct a powerful fan to blow air across the M2090 heatsink, which will probably cause it to “last” longer, but it may still overheat.

njuffa · June 10, 2016, 2:46am

As I recall, M2090s were hot running parts, so I am doubtful a home-grown forced-air contraption would be able to keep this GPU sufficiently cool under full load. I have seen reports by “modders” who got M2090s to work with water cooling, so you may want to research that. As txbob explained, operating an M2090 outside of a server enclosure designed for such GPUs is an unsupported configuration and entirely at the risk of the user.

Envidoc · June 10, 2016, 9:26am

Thanks for this responses.

I see that this may has caused the problem.

Just out of curiosity, how fast does the temperature of a gpu core can change?
I mean when I have a look at GPU-Z (which shows me a temp value of approx. 50°C when not utilized by a program [But Im not so sure if this is the right]) the temperature changes rather slow as soon as utilization increases.

Robert_Crovella · June 10, 2016, 6:59pm

It can change fast (1 degree C increase per second or faster) depending on what the GPU is doing. Or it may change more slowly if the workload is lower or the GPU is idle.

scottofazphx · August 22, 2016, 8:13pm

I use an easy solution for the M2090 cooling and temperature monitoring. I have 12 of these some inside a server but 2 are in the plex 375, you can pic one up on ebay for less then 100 bucks. You can fit 2 m2090’s in them. The fan of the plex can be controlled by the computer and temperature monitoring can be viewed by hwinfo64

Topic		Replies	Views
Compute Card Fails when switching jobs CUDA Setup and Installation	8	623	February 23, 2020
Tesla Temperature Monitoring CUDA Programming and Performance	17	6153	May 8, 2024
M2050 cooling Passive cards in a non-server case CUDA Programming and Performance	16	10476	June 28, 2017
The way to prevent overheat GPU CUDA Programming and Performance	1	2964	February 20, 2019
nvidia-smi tool and Tesla M2050 Doesn't report temperature value? CUDA Programming and Performance	7	8592	June 13, 2012
Nvidia Tesla S2050 + Tesla K20 - problems. CUDA Setup and Installation	1	612	July 26, 2019
TESLA M2070 troubleshooting. Probably broken hardware components? CUDA Programming and Performance	5	6040	April 30, 2012
Tesla D870 GPU core temp Is there a way to read it? CUDA Programming and Performance	8	6884	April 11, 2008
Best GPU for AI workloads (not DL training) CUDA Programming and Performance	16	5559	April 1, 2021
TESLA M10 driver? Tesla Boards	0	2326	May 28, 2021

Problem with M2090

Related topics