Tesla K80 stopped working

I’ve been using the Tesla K80 for several months through Matlab. It worked fine until a few days ago. Every time I use gpuDevice to select the K80, Matlab completely freezes. I would have to force close it after that. My computer has both Tesla K80 for computations and GTX 970 for display (they were working fine).

There is nothing wrong with the Tesla K80. When I remove the GTX 970, the K80 works fine. When I put back the 970, the K80 can’t be selected, even though it shows up in the device list in the device manager and the nVidia control panel.

I have updated the drivers for both cards to the latest as of today.

Can someone help please?

Setup
Windows 10 64-bit latest update (28/9/2016)
Tesla K80 Release date 2016.8.18 version 354.99 CUDA 7.5
GTX 970 Release date 2016.9.21 version 372.90
Matlab 2016b

Many thanks

So what happened on the day your setup stopped working, after being functional for a long time? My working hypothesis is that there was a change made to either hardware or software on that day. In which case you would want to find out what it was and undo that change.

I am debating with myself whether a defective GTX 970 could lead to these symptoms. Have you tried the GTX 970 in a different machine to see whether it is still functional? I am also wondering what kind of enclosure is suitable for both a passively cooled K80 and an actively cooled GTX 970. Any detail on that?

It’s not possible to have 2 different drivers installed (354.99 and 372.90)

I would suggest you start by selecting the driver that comes with CUDA 7.5 and use that.

What sort of machine is the K80 installed in? Is it an OEM server that is certified for use with K80?

“So what happened on the day your setup stopped working, after being functional for a long time?”

My guess, but this only a guess, is that this happened after Windows 10 updated. There were no hardware upgrades of any kind. I’m guessing because, I had stopped making computations for a few days. Anything could have happened since then.

“I am debating with myself whether a defective GTX 970 could lead to these symptoms”

I don’t think that it’s the GTX 970. In addition to working as a display card, Matlab utilizes it as a GPU without any problems (This is something that Matlab couldn’t do before. When I updated to the latest nVidia drivers, it worked so well it nearly reached the performance of the K80, in terms of computational speed).

“Have you tried the GTX 970 in a different machine to see whether it is still functional?”

I tried it on the same machine without the K80. It works fine.

“I am also wondering what kind of enclosure is suitable for both a passively cooled K80 and an actively cooled GTX 970. Any detail on that?”

I made my own rig. I built a fan/duct for the card. This is a video of it:

https://www.youtube.com/watch?v=UcyP3ASRgBc

“It’s not possible to have 2 different drivers installed (354.99 and 372.90).I would suggest you start by selecting the driver that comes with CUDA 7.5 and use that.”

I had both before (at least that what I thought). I’m going to uninstall all drivers and reinstall the K80 driver alone to see if it works.

“What sort of machine is the K80 installed in? Is it an OEM server that is certified for use with K80?”

It’s not certified. I have made video of it. I posted it in the previous post.

Operating a K80 outside a certified server solution means all bets are off (one might also say, “you are asking for trouble” :-)

My first guess with such jury-rigged solutions is usually that cooling is insufficient. Based on the video, you are clearly aware of that issue. Whether your home-grown solution is sufficiently robust I cannot judge, I have never tried this myself. Best I know, server solutions normally push air over the passive heatsink, rather than drawing it over the heat sink as in your setup. The difference in the direction of the airflow could make a difference in terms of turbulence inside the shroud, and it could also cause dust accumulation between the fins of the heat sink.

As I mentioned, the card worked very well for several months. In fact, it still works just as well when I remove the 970 (something I forgot to mention in the first post). I don’t think that the issue is how I’m cooling the card. I believe the fan/duct is sufficient to draw the air out and cool the card. Plus, the rig is placed in a cold room.

You would definitely want to follow up on txbob’s driver cleanup suggestion, and your driver selection should probably be driven by what is needed to operate the K80, which may be more restrictive in terms of suitable drivers (I think the Tesla drivers are updated less frequently than the consumer card drivers, but I could be wrong).

So I’ve gone through cleanly-uninstalling all nVidia drivers, and reinstalling the K80 drivers alone.

I’m still faced with the same problem of having Matlab hang when I switch to the K80s.

Does anyone care to share their experiences, further thoughts on this?

Cough*

Purchase a K80 from an OEM in an OEM qualified server.

Your configuration and general approach are entirely unsupported.

My general approach might be unsupported officially, but I’m counting on help from others who might have had the same experience.

Bear in mind that I built this rig for my personal experiments. I’m planning on establishing a lab with a bunch of nVidia Tesla K80 cards. Let’s just hope that these sort of issues are not ignored by nVidia support. This would be a serious turn off.

I think you may be overestimating how many people run systems similar to yours (let alone identical ones) AND are active users of this forum. Any home-brew system has the drawback that typically nobody else runs a system that is exactly identical, which makes reproducibility iffy.

As your system used to work fine before it before it stopped working all of a sudden, I think your best bet is to methodically track down any and all changes that were made to the system (OS logs may help with that). Sometimes such issues are caused by changes people thought to be inconsequential, except they aren’t. Been there, done that, got the t-shirt.

When a hardware vendor (any vendor, not just NVIDIA) tells their customers that some configuration or use of their product is unsupported, they mean exactly that in my experience.

So many responses and almost no try of explanation. Pity.
The interesting thing is how come nvi-dia user or support might need an older version of a driver with an older card as nv-dia says even current 440 supports k80. And even if that approach works its a bad project attitude.
The “cleaning” is also a pity approach. Install / Uninstall - if it does not work correctly, leaves trash and later another driver might not work correctly then its something to fix not to create cleanup tools - its just a fault.

If Matlab freezes - its a fault of matlab - try to use other computational application whether the k80 will work there fine.
Nv-idia’s told you cannot have two driver packs installed - well you did so its not prohibited - if thats prohibited how come its allowed to be -these are their installers :-/.

Didn’t understand fully how did you come over removing the 970? k80 doesn’t have video output - it means that there must be another video output so that Windows OS works. Windows except windows server CLI version wont boot without a video adapter.

I can offer you to try to use Linux and work from VNC/Xserver /Xdisplay in matlab for example and if you have some other windows video output stop using the 970, as thats not needed. Keep your environment simple and unified :) If you need more power that - use 2x k80

P.S. Came here to find which motherboards were tested for k80 or which driver versions people used to have some drawbacks with.

And the lesson learned is that Support of a product don’t give a dime (intended) if there is an issue with an update, or with uninstall or with uncompatibility … as long as it is a special uncommon task :) if it only happened to you. Thats how i do not offer any Microsoft or any other stupport a helping hand and dont help them either! They should learn that community is not a way to help people - testing team OR exact installation procedures protocols are a way. Diagnostic tools - in Linux for about 20 years there is a glxgears standard application - simple and correct - it can show you how your GL is performing without too much of a hussle, just start it, enlarge window a little and compare what you see with what you expect to be able to see.

I found this thread when considering to install a Tesla K80 into a PC. And so I’d like to share here my own experience so that it might be helpful for others and possibly save them some time and money.

I have a PC with a GE Force 2070 Super GPU. It has Windows 10 Home for an operating system.

I got the Tesla K80 and also bought a small 80mm server fan with a custom 3D printed adapter to fit it onto the K80. Cooling was never an issue as the K80 stayed below 75C even during intense training runs.

Installing the K80 wasn’t a big deal but I did need to drop $120 on a new power supply to make sure it had enough power.

Then I installed the K80 drivers, which seemed to want to overwrite some of my 2070 Super GPU drivers. So I had to go to custom install and make sure that clean install was not selected. Then I had to separately reinstall my 2070 Super drivers. After some back and forth and changing some BIOS settings, mission accomplished: I had a PC with both the 2070 Super installed and the K80 installed. And both were “working” in the sense that they could perform GPU related tasks.

Here is where the issues began:

  1. The PC would crash if I had my FocusRite audio interface plugged in. So I had to unplug that.
  2. On initial boot after the PC hadn’t been on for a while, System Interrupts would take up CPU processing and max out two of my cores, rendering my PC slower than a session in the Senate. After a reboot, this problem would go away. Meaning every time I wanted to use my PC for anything, I would need to start, load windows, then reboot.
  3. Running Beat Saber in VR on the 2070 Super had some issues. Everything would get choppy whenever I turned my head. This was while the K80 was not doing any sort of training or anything. In other words, this setup was handicapping my 2070 Super GPU.

After removing the K80, resetting BIOS and reinstalling the 2070 drivers, the problems vanished. I’ve ordered a qualified older secondhand server that I’ll be using solely for machine learning tasks.

So I’m sure that Nvidia never intended the K80 to be installed with Windows Home and another graphics card. If you do decide to set this up in a PC, just be aware that it may have some major issues with your current hardware and not play nice. When they develop a GPU, much of the work is in debugging to make sure it works properly in a particular environment and plays nice with other devices. That has not been done with the K80 outside of qualified servers and so just be aware of what you are getting yourself into before you start.

1 Like