Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

ddobrev85 · October 2, 2021, 4:23pm

@wpierce how come the winblows driver has all the good stuff, and the linux users get a cut-off version of the drivers with reduced features? Are we something less than the winblows users or what?!
We’d like to be able to monitor mem temperatures, do cool things like OC mem/voltage control - things that are available to winblows users for ages…

To the driver developers ----- Get your heads out of your lazy asses and start doing your job. Aight?

akhtaranique · October 4, 2021, 7:33am

@nadeemm Still waiting on a response from NVIDIA. We really really do need the temperature readings on Linux. I do not want to install Windows just to be able to get temperature readings. Even an unofficial, trick on how to obtain it would be really helpful. If it is not possible because of the Memory Vendors not giving accurate specifications in Linux, then just let us know. It would be much more helpful if you guys are straightforward and honest about all of this so we can find an alternative solution.

Thanks.

paolo.forum · October 4, 2021, 9:21am

hellooo? any news here?

nadeemm · October 4, 2021, 5:58pm

@ thanhson1211130
The NVIDIA Driver monitors all the key temperatures and throttles as necessary - this should not result in your PC shutting down. Please spin up a new Topic and describe your PC setup and a little bit more about your training setup, and what triggers the issue etc. Lets approach this with an open mind - all the system components need to work and heavily loaded GPU is going to push all the system. There are tons of smart and experienced folks on these forums, and I can get help from NV engineers as needed too.

If after brainstorming and debugging we find that there is something we can improve to avoid this problem I can take that info ensure it gets attention.

I want every NVIDIA user to be able to mine or train on their GPU’s with random crashes like this, and millions do, which gives me confidence we’ll figure it.
Thanks

nadeemm · October 4, 2021, 6:08pm

@ddobrev85
The GPU data being provided via NVIDIA’s applications and tools is the same under linux and windows. If you see or know of API response or output from nvsmi which is different for a given GPU between Linux and Windows, then please spin up a new topic and call it out. I will personally take that data point and escalate it.

I can not account for or speak for third party tools, only NVIDIA documented APIs and provided tools.
Thanks

nadeemm · October 4, 2021, 6:16pm

@ akhtaranique
Please help me - help you.
Please describe the challenge or problem which you are facing for which you have determined its essential to know the VRAM temperature. I’m not trying to be clever or avoid your ask, I know you much have a reaon. Start a new topic describe the task your are running, and it doesn’t matter if its mining, training, running games anything, and the problem or challenge, and describe how you were planning to use the VRAM temp to solve that problem.
Give me and others in this community an opportunity to share our experiences and ideas on how to solve this problem and if we find the only or best way is to know the VRAM temp, so be it - I walk away with a real use case.

ddobrev85 · October 4, 2021, 6:56pm

Really, they are the same? Then how come we can’t control core & mem voltages under linux?
Care to show the appropriate APIs for that for linux?

And as for mem temperature - why not expose it, people want to see their mem temperatures. AMD doesn’t hide these things for linux users, but your company does.

Why is that?

buran.energia · October 5, 2021, 9:53am

What seems to be the problem is this (based on my 10 mins research): We get the VRAM temperature info for 30xx series from tools like HWInfo, some miner apps such as gminer and t-rex, etc., because they use an undocumented NVidia API to get it. This API seems to be only available in Windows CUDA drivers and I think is called “NvAPI_GPU_GetAllTempsEx”. This would explain why the same, for example, mining apps don’t show the VRAM temperature in Linux. Just make this hidden function available in Linux drivers.

iterium · October 5, 2021, 12:41pm

There’s huge community of Linux users, not just miners. It’s important to know VRAM temps (for linux users):

After you’ve done basic maintentance on your card (changed thermal pads/paste) the only way you know if everything is OK to look at memory/core temps. There’s no public spec available for the termal pads and large variaty of pads available, there’s no 100% gurantee installed pads will have best possible contact (or quality) or amount of thermal paste doesn’t prevent good contact. Wrong pads will shorten or even make card unusable. Agreed?
Training AI models on GPU generate a lot of heat, algorythms designed to work on the consumer GPU’s (not just in datacentres) or on the edge devices. In order to tweak the algorythm you need to know usage/temps for core/memory. Usage should be balanced (at least in our usecase) to work correctly on the edge.

nadeemm · October 5, 2021, 5:07pm

If you really want to know if its possible to control core & mem voltages in windows and linux - them please spin up a new topic with that ask, and I can try to get the answer. It would help even more if you could add what you are trying to achieve by tweaking these, and there may be alternative ways.
If the result is we have it for windows and not for Linux, then I can escalate.
I am not asking too much - just you ask spin up a new topic with a focused question.
If you have found a document or spec which states this is possible in windows and not in linux - please share it.
Thanks

nadeemm · October 5, 2021, 5:13pm

Thanks iterium,
I do appreciate you writing up these two objectives, and I do expect these are two most common use cases.

So please spin up two new topics - one for how to best verify if your cooling upgrade is working and the other asking for guidance on how to tune your GPU for best performance for extended training tasks.

Thanks!

nadeemm · October 5, 2021, 5:25pm

Hi Buran, and welcome to NVIDIA Developer forums !
You have the essence of this discussion:

NVIDIA has not officially provided VRAM temperature on 30 series in either Windows or Linux via any documented method. My ask for the community is to spin up new Topics with specific use cases/ challenges - frankly in Windows or Linux, and if we find the only way to achieve success is to get the vram temp, then I can take those examples and have those discussions within NVIDIA.
NVIDIA drivers dynamically monitor multiple components and throttle to ensure everything is working within specs
Some third party companies are using some unapproved method for getting something which they call VRAM temperature and these third party has not make this available in Linux.

Thanks again for joining in,
Nadeem

ddobrev85 · October 5, 2021, 8:45pm

@nadeemm You are arguing that we need to “achieve” something in order for you to escalate this. You talk about “alternate” ways, which are not alternate, not really…

I’d argue this - DOES WINDOWS REALLY need:

to be able to see memory temperature (and I don’t care how the various monitoring tools are able to get it - they are getting it from somewhere)
to be able to control core/memory voltages (again - various tools are available to do this), and no, I’m not talking about the crude “power stage” adjustment available on the linux driver. It sucks and you can’t compare it to the actual control available on windows

Does windows REALLY need all these things, then?
Do you provide “special” APIs for manufacturers to do fine-grained OC with control for all - using their own tools (MSI afterburner, for example, there’s plenty of others)

Why does windows get such a special treatment?

Answer these, if you can.
And stop telling people that “if there is no other way to do what you need, then I will escalate”.
We don’t care if there is an “other” way, an “alternative” or whatever. We want the EXACT same things available on the windows driver. Period.

And here’s your problem, personally - the community will keep demanding these things, and you will have to keep copypaste-answering until either you quit your job, or more people come to assist you, wasting their time. Keep that in mind - there are more of us than of you.

Sorry for multiple edits, I was quite angry last night… Linux gaming is on the rise, a lot of steam games are now playable on linux, yet nvidia doesn’t care about us.

buran.energia · October 6, 2021, 10:04am

Thank you for the welcome. The thing is, as far as I know, you won’t find a single tool that would show the alleged VRAM temperature for 30xx on Linux like they do on Windows. Not a single one. It is improbable that all those authors with Windows and Linux versions of their tools have decided not to implement this sought out feature for 30xx cards on Linux. The only explanation is that the same method doesn’t work on Linux.

I can provide you with a Python source code of one such Windows tool that uses NVAPI that shows the alleged VRAM temperature, so you can see how it works or show it to your colleagues and then make the same data somehow available on Linux?

ddobrev85 · October 6, 2021, 11:14am

Naaaaah, they won’t do that, because this will mean they have to “implement” it on the linux driver/api. That’s just something they won’t do - it will cost them money … to add something they already have :)

And… the mods over here are just copypasters - they copy and paste the predefined replies their masters give them from above…

mikeyx86 · October 7, 2021, 9:38am

NVIDIA drivers dynamically monitor multiple components and throttle to ensure everything is working within specs

Well, here’s one specific use case that shows how that’s a problem – I have multiple 30xx’s running under Linux. I do not have a viable Windows version for the use case in question. I was having a problem with one particular card underperforming by 5-15%. By all measures, it appeared to be thermally throttling, but I was not able to access the VRAM thermals to verify this hypothesis. All other temps were within specs and behaving normally. I had to proceed on the assumption that there was an issue with the thermal padding or with the silicon itself. Again, I was not able to access the temps on Linux to verify this, and running under Windows was not an option for my setup. As a result, I ran with this card in a hobbled state for about 4 months, making adjustments as best I could to get the performance up to “only” 5-10% below all my other cards. I discovered within the last 3 days that the issue was actually a defective cable (I replaced it because I reconfigured my setup, and had to add an extended-length cable for issues completely unrelated to this performance issue). Had I been able to rule out VRAM-related throttling, I would have easily discovered this issue months ago. But because I was not able to check the VRAM temp, I could not verify the proper performance of this critical component on this card, and I had to fly blind. This cost me 4 months of degraded peformance.

Please fix this issue so this will not happen to other customers like me in the future.

blake.hooper · October 8, 2021, 12:33am

Another use case I think is relevant is helping identify manufacturing errors in thermal pad placement. There’s plenty of examples on the internet where top level cards (3080 3080Ti) have been incorrectly built.
Here’s a small write up from Tom’s hardware:

nvidia1094 · October 8, 2021, 3:59pm

While we don’t currently have plans to expose this information, we appreciate your ideas and suggestions. Please keep them coming.
…
If the result is we have it for windows and not for Linux, then I can escalate.

It’s hard to solve temperature related bottlenecks when we don’t have all the temps on linux. Monitoring all temps is crucial for diagnostics and tuning, please don’t downplay the significance of heat and throttling to an audience of solution developers who need to have reliable monitoring. Even casual windows users already have this tooling, but here on linux we don’t. At the very least, Nvidia should be able to acknowledge the absence of this monitoring on linux is a real deficiency for linux developers. Please do escalate ASAP.

nadeemm · October 8, 2021, 8:21pm

Thanks - I will definitely use this as ammo.
For interest - what cable did you replace - was it the GPU Power cable ?

nadeemm · October 8, 2021, 8:24pm

Thanks for sharing - and this one use case is firmly on my list.

Topic		Replies	Views
Linux operating system can also read vram temperature values like windows Linux linux	8	4561	January 5, 2022
Getting Memory Current temperature on V100 System Management and Monitoring (NVML)	0	598	March 30, 2021
Overclocking doesn't work on Maxwell GPUs System Management and Monitoring (NVML)	14	557	February 28, 2026
We need memory junction temps on the Nvidia Quadro A4000 A5000 cards in linux soon Linux linux	0	556	January 22, 2022
Lack of a third fan reported by drivers is really starting to cause issues Linux nvbugs , thermal	22	3921	May 12, 2023
3090FE question System Management and Monitoring (NVML)	1	725	July 5, 2021
Reading the memory (junction) temperature via NVAPI NVAPI	5	3385	July 27, 2021
Is my card burned? CUDA Setup and Installation	15	2834	June 27, 2013
Any future problems running GPUs for 12+ hours at a time? running cards for long periods of time whi CUDA Programming and Performance	32	6490	September 24, 2010
GPUMon utility System Management and Monitoring (NVML)	0	7179	December 23, 2020

Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

Related topics