Absolutely! @2024a please make this happen!
Hi, nvidia can you FFS show the memtemp in linux? thx
I have joined specifically to comment here.
This has gotten ridiculous, @2024a can you please implement memory junction temperature readout in nvidia-smi (or nvrm, I donât care)?
The inability to monitor memory temps is going to end up costing a massive amount in warranty replacements.
And booting into windows isnât a workable solution (at least not for me) since duplicating my workload in windows isnât feasible.
It sounds like Nvidiaâs official position on monitoring memory temps in linux is WONTFIX. But I think everyone on this thread would like to understand why? This data point already exists on windows, so obviously the hardware provides it. Is there a technical reason for omitting it on linux or are Nvidia just treating linux as a second class citizen?
Hi,
For a given GPU, nvidia-smi output is the same between Windows and Linux. Here is the output on an RTX 2080:
Windows
C:\>nvidia-smi dmon -s p
# gpu pwr gtemp mtemp
# Idx W C C
0 12 42 -
0 11 42 -
0 6 41 -
Linux
wpierce: ~$ nvidia-smi dmon -s p
# gpu pwr gtemp mtemp
# Idx W C C
0 5 42 -
0 5 42 -
0 6 42 -
mtemp output from nvidia-smi is not memory junction temperature. The reported temperature is the hottest recorded across all HBM temperature sensors. It is only supported on SKUs with HBM memory.
@nadeemm will explain more about what memory temperatures are exposed and why.
Thank you for the info. From my limited experience with windows (full time linux user here) windows tools are able to monitor these values
Hereâs an example with hwinfo showing âGPU Memory Junction Temperatureâ:
Hereâs an example with GPU-z including both âHot Spotâ and âMemory Temperatureâ at (2:28s):
I understand that these data points are not available to us on linux, but I donât understand why. If the hardware didnât support it, then Iâm really confused about what the windows GPU monitoring tools are reporting. Why isnât it possible to get these same data points on linux? I hope @nadeemm will be able to speak to this.
Thanks
excuse me? this simple function is being requested for almost half a year, and you guys rather clearify desperate users and does nothing than pulling out the thing we really need.
Is it like âwe know what you need but we just donât give a $hit because itâs not coooooooolâ ? give us a short answer.
Plz plz plz add the function to the Linux driver!
You donât want us to know that your gddr6x cooling is totally shit, but, which everyone has already known. So, its fine to let us know the real vram tempď˝ we just want to know!
They will not provide this info, because it will turns out, that even without mining, GDDR6X is just hot like hell.
There is no trade secrets that we are interested in, Nividia. We are trying to assist you in reduction of damage of reputation and warranty claims.
Everyone knows max spec is 110C, we are just looking for an accurate way to monitor temperature to take proactive actions in Linux based systems. Simple GPU temp reporting doesnât tell the VRAM temperature scenario
stop posting BS and make me feel like a real costomer who buy a 1000$ crap from you.
@nadeemm will explain more about what memory temperatures are exposed and why.
When can we expect an update from @nadeemm?
To the extent that nvidia is mulling over whether itâs worth adding the feature or not, Iâd like to vote my opinion that most definitely yes it is. As a system builder, I use nvidia RTX cards and Iâm trying to optimize the airflow in order to keep the cards cool. Itâs impractical to do A/B testing with case airflow configurations when I canât get a reading of the memory temps. Those running windows do get the temps, but itâs no less important on linux. Given that the memory temps are known to run hot, itâs kind of important that we have this information when tweaking the case fans rather than doing so blind and just hoping for the best.
Any update? Really could do with that information. All Nvdia has said is that Windows cant do it either. It can?
Agreed!Agreed!Agreed! Need it!Need it!Need it!
So sad to see that despite of these comments nothing had been done by NVIDIA since february. :((
+1, I am a machine learning student, I spent 1000$ on my RTX 3080. When I trained my deep learning model, the RTX 3080 is extremely hot and my PC shut down usually. After a survey on the internet, It points out that problem is because of the Memory temperature of DDR6X technology. Hope Nvidia update a driver so that we can monitor memory junction temperature
I guess it is simple: NVidia just hiding something from us and this âsomethingâ can really harm their reputation and it is somehow connected to this damn temp data. And they will just say/do anything to keep this âsomethingâ hidden. So I guess it is pointless to ask them to reveal that temp data. Their aCCes are more dear to them then our aCCes, thatâs itâŚ
Try that command with a GPU that have GDDR6X.
We already know that you cannot monitor this temperature for GDDR6 in windows or Linux but the throttling does not seem be the an issue with this memorytype.
The issue is with GDDR6X with which the temperature can be monitored in windows but NOT in Linux.
Itâs soon been eight months since this topic was created and five months since @wpierce responded with âWe are currently tracking it under internal bug number 3269484.â and âIt was filed some time ago and is being prioritized. I hear your concerns and am making sure it gets addressed.â.
Memory junction temperature for gddr6x is exposed by the Windows version of the nvidia-driver, how come itâs seemingly impossible to achieve the same in the Linux driver?
Can you please have the Linux driver team talk to the Windows driver team and have this fixed?