Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

I’m glad to find this thread hoping that Nvidia hears its customers.

I agree with everyone else that Nvidia should enable Vram temp monitoring!!

Absolutely! @2024a please make this happen!

Hi, nvidia can you FFS show the memtemp in linux? thx

1 Like

I have joined specifically to comment here.
This has gotten ridiculous, @2024a can you please implement memory junction temperature readout in nvidia-smi (or nvrm, I don’t care)?
The inability to monitor memory temps is going to end up costing a massive amount in warranty replacements.
And booting into windows isn’t a workable solution (at least not for me) since duplicating my workload in windows isn’t feasible.

1 Like

It sounds like Nvidia’s official position on monitoring memory temps in linux is WONTFIX. But I think everyone on this thread would like to understand why? This data point already exists on windows, so obviously the hardware provides it. Is there a technical reason for omitting it on linux or are Nvidia just treating linux as a second class citizen?

1 Like

Hi,
For a given GPU, nvidia-smi output is the same between Windows and Linux. Here is the output on an RTX 2080:

Windows

C:\>nvidia-smi dmon -s p
# gpu   pwr gtemp mtemp
# Idx     W     C     C
    0    12    42     -
    0    11    42     -
    0     6    41     -

Linux

wpierce: ~$ nvidia-smi dmon -s p
# gpu   pwr gtemp mtemp
# Idx     W     C     C
    0     5    42     -
    0     5    42     -
    0     6    42     -

mtemp output from nvidia-smi is not memory junction temperature. The reported temperature is the hottest recorded across all HBM temperature sensors. It is only supported on SKUs with HBM memory.

@nadeemm will explain more about what memory temperatures are exposed and why.

3 Likes

Thank you for the info. From my limited experience with windows (full time linux user here) windows tools are able to monitor these values

Here’s an example with hwinfo showing “GPU Memory Junction Temperature”:

Here’s an example with GPU-z including both “Hot Spot” and “Memory Temperature” at (2:28s):

I understand that these data points are not available to us on linux, but I don’t understand why. If the hardware didn’t support it, then I’m really confused about what the windows GPU monitoring tools are reporting. Why isn’t it possible to get these same data points on linux? I hope @nadeemm will be able to speak to this.

Thanks

2 Likes

excuse me? this simple function is being requested for almost half a year, and you guys rather clearify desperate users and does nothing than pulling out the thing we really need.
Is it like “we know what you need but we just don’t give a $hit because it’s not cooooooool” ? give us a short answer.

4 Likes

Plz plz plz add the function to the Linux driver!

5 Likes

You don’t want us to know that your gddr6x cooling is totally shit, but, which everyone has already known. So, its fine to let us know the real vram temp~ we just want to know!

4 Likes

They will not provide this info, because it will turns out, that even without mining, GDDR6X is just hot like hell.

2 Likes

There is no trade secrets that we are interested in, Nividia. We are trying to assist you in reduction of damage of reputation and warranty claims.

Everyone knows max spec is 110C, we are just looking for an accurate way to monitor temperature to take proactive actions in Linux based systems. Simple GPU temp reporting doesn’t tell the VRAM temperature scenario

1 Like

stop posting BS and make me feel like a real costomer who buy a 1000$ crap from you.

1 Like

@nadeemm will explain more about what memory temperatures are exposed and why.

When can we expect an update from @nadeemm?

To the extent that nvidia is mulling over whether it’s worth adding the feature or not, I’d like to vote my opinion that most definitely yes it is. As a system builder, I use nvidia RTX cards and I’m trying to optimize the airflow in order to keep the cards cool. It’s impractical to do A/B testing with case airflow configurations when I can’t get a reading of the memory temps. Those running windows do get the temps, but it’s no less important on linux. Given that the memory temps are known to run hot, it’s kind of important that we have this information when tweaking the case fans rather than doing so blind and just hoping for the best.

4 Likes

Any update? Really could do with that information. All Nvdia has said is that Windows cant do it either. It can?

2 Likes

Agreed!Agreed!Agreed! Need it!Need it!Need it!

So sad to see that despite of these comments nothing had been done by NVIDIA since february. :((

1 Like

+1, I am a machine learning student, I spent 1000$ on my RTX 3080. When I trained my deep learning model, the RTX 3080 is extremely hot and my PC shut down usually. After a survey on the internet, It points out that problem is because of the Memory temperature of DDR6X technology. Hope Nvidia update a driver so that we can monitor memory junction temperature

I guess it is simple: NVidia just hiding something from us and this “something” can really harm their reputation and it is somehow connected to this damn temp data. And they will just say/do anything to keep this “something” hidden. So I guess it is pointless to ask them to reveal that temp data. Their aCCes are more dear to them then our aCCes, that’s it…

Try that command with a GPU that have GDDR6X.
We already know that you cannot monitor this temperature for GDDR6 in windows or Linux but the throttling does not seem be the an issue with this memorytype.
The issue is with GDDR6X with which the temperature can be monitored in windows but NOT in Linux.

4 Likes