Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

+1

2 Likes

+1 I need this feature. Either add it or fix your thermal pads. I’m tired of your nonsense, Nvidia.

4 Likes

+1 Temp is a MUST!

4 Likes

Currently finetuning a machine learning model, this feature would be very useful. However, constantly monitor the temps of the GPU is not the correct solution, Nvidia should replace the FE it has sold with ones with adequate cooling.

5 Likes

Let’s be realistic; that’s just not going to happen, and if it did it would take so long that the next series of GPUs would be out by now given the chip shortages.

However, having the ability to monitor the temps is a very quick and easy change that will go a long way for all of us.

3 Likes

You probably missed this.

Also spamming +1’s after this linked message doesn’t make process any faster.

2 Likes

I am unable to edit my OP, but @wpierce, feel free to edit the OP with the current status. It would be unfair to mark your post as a solution.

Spamming +1’s may actually make the process faster implicitly. More accurately, what it does is serve as a good proxy of the number of customers significantly impacted by this problem (after all, they went through the effort of posting). The higher that number is, the higher Nvidia should prioritise this problem and ought to allocate resources accordingly.

6 Likes

Please add this feature on Linux, thank you

5 Likes

Necessary feature for many users and admins. Please add this feature for Linux as well. Thanks a lot

3 Likes

+1
to cool the card adequately, the temperature info is essential.

3 Likes

+1
This is an absolutely essential feature with the high VRAM temps of the 30 series cards. I’m able to test cards on a Windows bench to determine undervolting settings for safe operating temperatures before setting them up in Linux, but not every work environment allows for this.

3 Likes

What settings have you found to keep the temp in check?

2 Likes

+1 for this feature.

WTF. We have issues when training DL models in Linux. The fact that Nvidia haven’t fix this simple feature request is pure insanity.

3 Likes

Glad to see that this feature is in progress, adding +1 for appreciation and also for urgency.

Anybody running heavy loads on their cards under linux is currently stuck relying only on GPU temps, which is not a good situation.

3 Likes

+1 we really need this also for Linux.
BTW also the new 30 series vBios updater does not run on linux…

2 Likes

As a datascience company, we’re currently moving from 2080 Ti cards to the 3000 series, but this is a serious concern for us.
We bought some RTX 3080 cards for benchmarking. We experienced performance throttling because of the high memory temperatures (GPU core temp at 45°C, fan speed at 100%), and on top of that, we can’t monitor the memory temperature.

We’re looking forward to further improvements on these two points!

3 Likes

I don’t know how ignorant a solution this is, but we’ve started using fan speed as an analog for VRAM temp. Assuming the bios on the card can read the temp and control the fan accordingly, we assume if the fan isn’t at 100%, the temp must not be at max?

Its predicated on a big assumption; a possibly poor one.

We have some software that runs a PID loop, adjusting the power limit to the cards to keep the fan speed near 85%.

FWIW. I’ll report back when our cards start melting down :p

Is there any progress on this case?

2 Likes

Knock knock knock.

Your question has been received. You should expect a response from us within 24 hours.

Subject

3269484 Internal Bug status

Question Reference # 210518-000156

Date Created: 05/18/2021 08:37 AM

Status: Researching

3 Likes

Thanks for chasing them up @testkayit. Unfortunately it’s very much a case of Nvidia not caring enough at this point.

The next step is to try and get them to actually put a deadline on this ticket so it increases from the almost infinite pool of lowest priority tickets.

@wpierce can you be our internal champion please 🙏

3 Likes