Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

@wpierce @nadeemm

Why is so difficult to understand the issue, is so easy to replicate in linux:

nvidia-smi dmon

More than 6 months and not a real solution here, is really annoying, especially for us that invested thousand of dollars in your products.

Why not to switch to AMD in the near future?

Thanks…

4 Likes

I’m glad to find this thread hoping that Nvidia hears its customers.

I agree with everyone else that Nvidia should enable Vram temp monitoring!!

Absolutely! @nvidia please make this happen!

Hi, nvidia can you FFS show the memtemp in linux? thx

I have joined specifically to comment here.
This has gotten ridiculous, @nvidia can you please implement memory junction temperature readout in nvidia-smi (or nvrm, I don’t care)?
The inability to monitor memory temps is going to end up costing a massive amount in warranty replacements.
And booting into windows isn’t a workable solution (at least not for me) since duplicating my workload in windows isn’t feasible.

It sounds like Nvidia’s official position on monitoring memory temps in linux is WONTFIX. But I think everyone on this thread would like to understand why? This data point already exists on windows, so obviously the hardware provides it. Is there a technical reason for omitting it on linux or are Nvidia just treating linux as a second class citizen?

Hi,
For a given GPU, nvidia-smi output is the same between Windows and Linux. Here is the output on an RTX 2080:

Windows

C:\>nvidia-smi dmon -s p
# gpu   pwr gtemp mtemp
# Idx     W     C     C
    0    12    42     -
    0    11    42     -
    0     6    41     -

Linux

wpierce: ~$ nvidia-smi dmon -s p
# gpu   pwr gtemp mtemp
# Idx     W     C     C
    0     5    42     -
    0     5    42     -
    0     6    42     -

mtemp output from nvidia-smi is not memory junction temperature. The reported temperature is the hottest recorded across all HBM temperature sensors. It is only supported on SKUs with HBM memory.

@nadeemm will explain more about what memory temperatures are exposed and why.

1 Like

Thank you for the info. From my limited experience with windows (full time linux user here) windows tools are able to monitor these values

Here’s an example with hwinfo showing “GPU Memory Junction Temperature”:

Here’s an example with GPU-z including both “Hot Spot” and “Memory Temperature” at (2:28s):

I understand that these data points are not available to us on linux, but I don’t understand why. If the hardware didn’t support it, then I’m really confused about what the windows GPU monitoring tools are reporting. Why isn’t it possible to get these same data points on linux? I hope @nadeemm will be able to speak to this.

Thanks

1 Like

excuse me? this simple function is being requested for almost half a year, and you guys rather clearify desperate users and does nothing than pulling out the thing we really need.
Is it like “we know what you need but we just don’t give a $hit because it’s not cooooooool” ? give us a short answer.

2 Likes