Still waiting.
I think they just dont care about it, AMD dont have problems like that, i just wait for their new gpu, and say bye to nvidia
Which is very odd since nVidia likes to brag about what new supercomputer they helped build, or new contracts with different institutions and academia, all of which must run Linux! I wonder how those are faring with nVidia support for linux?! No wonder Apple ditched nVidia for AMD, considering their lazy driver updates and also of course the dreaded hardware flops.
+1
@wpierce I will gladly work on this feature for free, sign NDAs, etc. Let’s just get this done. Don’t know what stage you’re in, but I work with everything from design and development to testing. It shouldn’t be this difficult. Non-critical feature, just release something and improve it over time. We’re talking about displaying a single float, come on.
All - We appreciate your comments and suggestions.
The GPU actively monitors and manages many thermal parameters. We work with memory vendors to ensure that their operating specifications are not only met by design but also thoroughly tested in extreme conditions to verify compliance.
For those of you who are interested, the Micron’s GDDR6X specifications can be found here:
The memory case temperature is not exposed by any third-party tools authorized by NVIDIA on Windows or Linux. Existing third-party tools appear to be reporting numbers that do not represent the relevant case temperature (Tc) specification and it’s normal for other readings to show higher values.
While we don’t currently have plans to expose this information, we appreciate your ideas and suggestions. Please keep them coming.
We have documented some of the factors which affect clock throttling on page 11 of the NVIDIA SMI manual.
To see the manual, on a system with smi installed, use the command: man nvidia-smi
Alternatively this is the last one we posted online: https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf
NVIDIA SMI uses NVIDIA NVML APIs which are documented here: NVML API Reference Guide :: GPU Deployment and Management Documentation
Appreciate your reply. But basically you are saying we don’t have to wait for this. It won’t come.
For a card that is so sensitive to it’s memory temperature it would be really nice to have had this feature. So maybe before designing the next cards keep in mind to get the relevant temperatures or add a simple temperature sensor if you can’t get the reading otherwise?
We can get the reason for the throttling via NvmlClocksThrottleReasons we want to run them in a save manner and it is not so easy to fix these thermal problems without having a number to verify what effect changes have. Especially if you have large GPU clusters for ML.
The temperature reported by third party tools on windows lets us at least get the house number where the memory temperature is at and see if cooling changes have a positive effect. We did optimise some cooling via this. But for the larger clusters with hundreds of 3090 it is just not feasible to install windows for debugging the cooling. So it would really help to even get the " numbers that do not represent the relevant case temperature (Tc) specification" as it is better than nothing as your design didn’t include having the relevant information accessible.
So I hope nvidia reconsiders its decision and could at least add the same temperature information we can access in windows also to the linux driver.
@wpierce you said before we are working on it and will come soon. everyone is waiting patiently for months. now you say we won’t expose with on windows and Linux. but how can we see the vmem temp on windows?
I hope Nvidia changes that stupid decision. is it just the decision or are there any technical problems to expose the temperatures out?
where are the thirdparty tools generating the vram temp reading figures from? Basically we just want that info but in linux. Those readings must be generated from somewhere.
With all due respect we didn’t ask for Tc (case temp), we asked for TjMax which is the maxium junction temperature, meaning mostly the hottest spot of the hot spots. From experience, with a 3090 FE, that’s usually the memory chip on the lower part of the GPU back (near the PCIe connector, since that chip stands lowest and has the least airflow or heat dissipation from the passive backplate.
The Windows monitoring tool HWinfo states:
This is the internal junction temperature measured inside the silicon, NOT the external (case) temperature!!! As such higher values (than usual Tcase) are expected here. Thermal throttling starts around 110C
In conclusion, please revisit this request from this thread, as all of us here don’t care about Tcase, so if you don’t plan to expose this information as you claim, it’s ok with us, consumers. We only care about the TjMax, which is != Tcase, so that to avoid thermal throttles and better care for the hardware that you provided.
If you would like to get some more insight into the NVML throttling - please spin up a new topic, and if possible ask any specific questions you have in mind - it helps me to get responses from the right folks.
Thanks so much !
Exactly!
So no MEM temperature under linux?
This is a F-ing joke nVidia…
+1 this is exactly what is needed, they however not doing it for some reason…
@nadeemm We don’t care about NVML, we care about TjMax being visible in nvidia-smi
Get it through your head already, its not difficult …
I don’t think you are grasping the issue:
Running “nvidia-smi dmon” works in windows.
In linux it shows mtemp empty.
I don’t think you are grasping the issue:
Running “nvidia-smi dmon” works in windows.
In linux it shows mtemp empty. Try it, please.
Really, many of us have been checking in on this frequently since february.
Hi NVidia.
Not sure if you are understanding the situation. It’s pretty simple.
The nvram junction temp, hottest part of gpu of 3090, needs to be monitored.
Surely, being that you have tested your product prior to shipping (right?), you are acutely aware of this.
I won’t comment on your usage of poor thermal pads. Nope, won’t do it.
HwInfo can do this on Windows; despite being not totally accurate.
We are perfectly happily with it not being perfectly accurate with a dedicated sensor. It’s far more intel than otherwise provided (none).
No one anywhere in this thread ever mentioned or cared about case temp. That is a derivation to the unimaginable n’th degree.
Implement a NVAPI update to provision said measurement for linux. Simple. Thanks.
Go for Amd they dont have problems with Linux. Bye
Hi,
We’re not asking for the case temperature as people have pointed out.
Here’s the output of nvidia-smi dmon -s pucvme
Note the mtemp column is empty on linux. In windows it displays the memory junction temperature.
That is the temperature we are after.
bertha ➜ ~ nvidia-smi dmon -s pucvme
gpu pwr gtemp mtemp sm mem enc dec mclk pclk pviol tviol fb bar1 sbecc dbecc pci # Idx W C C % % % % MHz MHz % bool MB MB errs errs errs
0 229 62 - 100 100 0 0 10451 1575 0 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0 0 229 62 - 100 100 0 0 10451 1560 100 0 4845 9 - - 0