Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

ruifreitas.95 · April 16, 2021, 1:15pm

I run/manage a rendering farm with a bunch of RTX 30XX cards. They are slightly overclocked and run 24/7. We try to keep them as cool as possible but from time to time we have crashes because some cards start to run hotter.

We had to run a benchmark for 12h using Windows and HWINFO64 to understand what was going on until we saw the memory junction temperature getting around 100C (max: 108C) until the card crashed.

We understand that this is related with GDDR6X memory and how the chips work.
However, we are unable to correctly monitor the memory temperature because we run our cluster under Linux.

Our goal is to keep the investment on these cards but without proper monitoring and the risk of loosing cards due to high temps is hard to justify not going to AMD or older NVIDIA cards.

Note: we use Prometheus Exporters/Grafana to monitor the host and each card. Unfortunately, due to the lack of support in Linux, the exporter is also not able to export the memory junction temperature.

benak · April 18, 2021, 4:18am

damn. seems jenson and his team don’t give a s**t.

ismet.togay · April 18, 2021, 8:12pm

+1.

I think this is a must.

Turowicz · April 20, 2021, 8:40am

+1 for this cc @kayccc

yuri.rokutan · April 22, 2021, 5:23am

This is necessary for DL trainers. +1 request.

XenoSword · April 24, 2021, 4:16am

+1
More than 2 months have passed, Nvidia please take an action.

richardwhit · April 25, 2021, 8:22pm

+1 here - this should definitely be made available for Linux, especially when we know there are potential issues here.

nicolas.cantoli · April 27, 2021, 1:05am

+1 Please

wpierce · April 27, 2021, 4:36pm

We are currently tracking it under internal bug number 3269484.

CornyChoney · April 27, 2021, 4:42pm

That’s great! Are you able to share any more information (when did you start addressing this/when was the ticket raised, any progress on the fix, any ETA for the fix)?

wpierce · April 27, 2021, 5:46pm

It was filed some time ago and is being prioritized. I hear your concerns and am making sure it gets addressed.

rcko · April 28, 2021, 11:06pm

+1
Critical for deep learning

fl0xx · April 29, 2021, 2:52am

Please prioritise this, I’ve been stuck a month without being able to use my expensive card because I can’t check vram temperatures when training models and don’t want to burn my card. This should be of the highest priority since we have no idea what our hardware is being exposed to during full loads.

phiona.herself · April 29, 2021, 5:05am

+1
This functionality is needed.

Bryon · April 29, 2021, 10:38am

+1 Fully agree

trolernoob36 · April 29, 2021, 12:53pm

Knowing if you hit 110 C or more is really important especially if you’re overclocking. We need to be able to control the temperature in Linux :0

1319501722 · April 29, 2021, 12:59pm

+10086 Fully agree

hugocowan · April 29, 2021, 1:03pm

Would like this please!

afturmath · April 29, 2021, 1:49pm

+1. Critical for long renders

citizen-stig · April 29, 2021, 2:01pm

Please please implement this in linux driver.

So needed feature for linux users and many projects depend on that.

Topic		Replies	Views
Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons Linux	159	26055	February 6, 2024
Has anyone been able to run an RTX 3060 laptop GPU at more than 80W on Linux? Linux	110	38296	March 13, 2024
One weird trick to get a Maxwell v2 GPU to reach its max memory clock ! CUDA Programming and Performance	59	17918	April 22, 2016
Dramatic overall performance and heat generation with GeForce GTX 1070 with Max-Q Design Linux	18	6488	October 2, 2018
Painfully long driver initialization with many GPUs -- affects ALL drivers (Nvidia, please do someth... Linux	38	5363	October 12, 2021
>=334.21 Redrawing problems in Gnome 3.10/3.12 - GTX 580 Linux	98	50526	September 21, 2015
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13180	May 20, 2012
Problems after inserting a P100 CUDA Setup and Installation	35	3846	October 31, 2021
How to set fanspeed in Linux from terminal CUDA Programming and Performance	30	87972	January 11, 2025
High CPU usage on xorg when the external monitor is plugged in Linux	120	38593	June 21, 2023

Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

Related topics