Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

It’s soon been eight months since this topic was created and five months since @wpierce responded with “We are currently tracking it under internal bug number 3269484.” and “It was filed some time ago and is being prioritized. I hear your concerns and am making sure it gets addressed.”.

Memory junction temperature for gddr6x is exposed by the Windows version of the nvidia-driver, how come it’s seemingly impossible to achieve the same in the Linux driver?

Can you please have the Linux driver team talk to the Windows driver team and have this fixed?

6 Likes

@wpierce how come the winblows driver has all the good stuff, and the linux users get a cut-off version of the drivers with reduced features? Are we something less than the winblows users or what?!
We’d like to be able to monitor mem temperatures, do cool things like OC mem/voltage control - things that are available to winblows users for ages…

To the driver developers ----- Get your heads out of your lazy asses and start doing your job. Aight?

@nadeemm Still waiting on a response from NVIDIA. We really really do need the temperature readings on Linux. I do not want to install Windows just to be able to get temperature readings. Even an unofficial, trick on how to obtain it would be really helpful. If it is not possible because of the Memory Vendors not giving accurate specifications in Linux, then just let us know. It would be much more helpful if you guys are straightforward and honest about all of this so we can find an alternative solution.

Thanks.

hellooo? any news here?

@ thanhson1211130
The NVIDIA Driver monitors all the key temperatures and throttles as necessary - this should not result in your PC shutting down. Please spin up a new Topic and describe your PC setup and a little bit more about your training setup, and what triggers the issue etc. Lets approach this with an open mind - all the system components need to work and heavily loaded GPU is going to push all the system. There are tons of smart and experienced folks on these forums, and I can get help from NV engineers as needed too.

If after brainstorming and debugging we find that there is something we can improve to avoid this problem I can take that info ensure it gets attention.

I want every NVIDIA user to be able to mine or train on their GPU’s with random crashes like this, and millions do, which gives me confidence we’ll figure it.
Thanks

1 Like

@ddobrev85
The GPU data being provided via NVIDIA’s applications and tools is the same under linux and windows. If you see or know of API response or output from nvsmi which is different for a given GPU between Linux and Windows, then please spin up a new topic and call it out. I will personally take that data point and escalate it.

I can not account for or speak for third party tools, only NVIDIA documented APIs and provided tools.
Thanks

1 Like

@ akhtaranique
Please help me - help you.
Please describe the challenge or problem which you are facing for which you have determined its essential to know the VRAM temperature. I’m not trying to be clever or avoid your ask, I know you much have a reaon. Start a new topic describe the task your are running, and it doesn’t matter if its mining, training, running games anything, and the problem or challenge, and describe how you were planning to use the VRAM temp to solve that problem.
Give me and others in this community an opportunity to share our experiences and ideas on how to solve this problem and if we find the only or best way is to know the VRAM temp, so be it - I walk away with a real use case.

1 Like

Really, they are the same? Then how come we can’t control core & mem voltages under linux?
Care to show the appropriate APIs for that for linux?

And as for mem temperature - why not expose it, people want to see their mem temperatures. AMD doesn’t hide these things for linux users, but your company does.

Why is that?

What seems to be the problem is this (based on my 10 mins research): We get the VRAM temperature info for 30xx series from tools like HWInfo, some miner apps such as gminer and t-rex, etc., because they use an undocumented NVidia API to get it. This API seems to be only available in Windows CUDA drivers and I think is called “NvAPI_GPU_GetAllTempsEx”. This would explain why the same, for example, mining apps don’t show the VRAM temperature in Linux. Just make this hidden function available in Linux drivers.

1 Like

There’s huge community of Linux users, not just miners. It’s important to know VRAM temps (for linux users):

  1. After you’ve done basic maintentance on your card (changed thermal pads/paste) the only way you know if everything is OK to look at memory/core temps. There’s no public spec available for the termal pads and large variaty of pads available, there’s no 100% gurantee installed pads will have best possible contact (or quality) or amount of thermal paste doesn’t prevent good contact. Wrong pads will shorten or even make card unusable. Agreed?

  2. Training AI models on GPU generate a lot of heat, algorythms designed to work on the consumer GPU’s (not just in datacentres) or on the edge devices. In order to tweak the algorythm you need to know usage/temps for core/memory. Usage should be balanced (at least in our usecase) to work correctly on the edge.

1 Like

If you really want to know if its possible to control core & mem voltages in windows and linux - them please spin up a new topic with that ask, and I can try to get the answer. It would help even more if you could add what you are trying to achieve by tweaking these, and there may be alternative ways.
If the result is we have it for windows and not for Linux, then I can escalate.
I am not asking too much - just you ask spin up a new topic with a focused question.
If you have found a document or spec which states this is possible in windows and not in linux - please share it.
Thanks

Thanks iterium,
I do appreciate you writing up these two objectives, and I do expect these are two most common use cases.

So please spin up two new topics - one for how to best verify if your cooling upgrade is working and the other asking for guidance on how to tune your GPU for best performance for extended training tasks.

Thanks!

Hi Buran, and welcome to NVIDIA Developer forums !
You have the essence of this discussion:

  1. NVIDIA has not officially provided VRAM temperature on 30 series in either Windows or Linux via any documented method. My ask for the community is to spin up new Topics with specific use cases/ challenges - frankly in Windows or Linux, and if we find the only way to achieve success is to get the vram temp, then I can take those examples and have those discussions within NVIDIA.
  2. NVIDIA drivers dynamically monitor multiple components and throttle to ensure everything is working within specs
  3. Some third party companies are using some unapproved method for getting something which they call VRAM temperature and these third party has not make this available in Linux.

Thanks again for joining in,
Nadeem

@nadeemm You are arguing that we need to “achieve” something in order for you to escalate this. You talk about “alternate” ways, which are not alternate, not really…

I’d argue this - DOES WINDOWS REALLY need:

  • to be able to see memory temperature (and I don’t care how the various monitoring tools are able to get it - they are getting it from somewhere)
  • to be able to control core/memory voltages (again - various tools are available to do this), and no, I’m not talking about the crude “power stage” adjustment available on the linux driver. It sucks and you can’t compare it to the actual control available on windows

Does windows REALLY need all these things, then?
Do you provide “special” APIs for manufacturers to do fine-grained OC with control for all - using their own tools (MSI afterburner, for example, there’s plenty of others)

Why does windows get such a special treatment?

Answer these, if you can.
And stop telling people that “if there is no other way to do what you need, then I will escalate”.
We don’t care if there is an “other” way, an “alternative” or whatever. We want the EXACT same things available on the windows driver. Period.

And here’s your problem, personally - the community will keep demanding these things, and you will have to keep copypaste-answering until either you quit your job, or more people come to assist you, wasting their time. Keep that in mind - there are more of us than of you.

Sorry for multiple edits, I was quite angry last night… Linux gaming is on the rise, a lot of steam games are now playable on linux, yet nvidia doesn’t care about us.

2 Likes

Thank you for the welcome. The thing is, as far as I know, you won’t find a single tool that would show the alleged VRAM temperature for 30xx on Linux like they do on Windows. Not a single one. It is improbable that all those authors with Windows and Linux versions of their tools have decided not to implement this sought out feature for 30xx cards on Linux. The only explanation is that the same method doesn’t work on Linux.

I can provide you with a Python source code of one such Windows tool that uses NVAPI that shows the alleged VRAM temperature, so you can see how it works or show it to your colleagues and then make the same data somehow available on Linux?

1 Like

Naaaaah, they won’t do that, because this will mean they have to “implement” it on the linux driver/api. That’s just something they won’t do - it will cost them money … to add something they already have :)

And… the mods over here are just copypasters - they copy and paste the predefined replies their masters give them from above…

  1. NVIDIA drivers dynamically monitor multiple components and throttle to ensure everything is working within specs

Well, here’s one specific use case that shows how that’s a problem – I have multiple 30xx’s running under Linux. I do not have a viable Windows version for the use case in question. I was having a problem with one particular card underperforming by 5-15%. By all measures, it appeared to be thermally throttling, but I was not able to access the VRAM thermals to verify this hypothesis. All other temps were within specs and behaving normally. I had to proceed on the assumption that there was an issue with the thermal padding or with the silicon itself. Again, I was not able to access the temps on Linux to verify this, and running under Windows was not an option for my setup. As a result, I ran with this card in a hobbled state for about 4 months, making adjustments as best I could to get the performance up to “only” 5-10% below all my other cards. I discovered within the last 3 days that the issue was actually a defective cable (I replaced it because I reconfigured my setup, and had to add an extended-length cable for issues completely unrelated to this performance issue). Had I been able to rule out VRAM-related throttling, I would have easily discovered this issue months ago. But because I was not able to check the VRAM temp, I could not verify the proper performance of this critical component on this card, and I had to fly blind. This cost me 4 months of degraded peformance.

Please fix this issue so this will not happen to other customers like me in the future.

3 Likes

Another use case I think is relevant is helping identify manufacturing errors in thermal pad placement. There’s plenty of examples on the internet where top level cards (3080 3080Ti) have been incorrectly built.
Here’s a small write up from Tom’s hardware:

2 Likes

While we don’t currently have plans to expose this information, we appreciate your ideas and suggestions. Please keep them coming.

If the result is we have it for windows and not for Linux, then I can escalate.

It’s hard to solve temperature related bottlenecks when we don’t have all the temps on linux. Monitoring all temps is crucial for diagnostics and tuning, please don’t downplay the significance of heat and throttling to an audience of solution developers who need to have reliable monitoring. Even casual windows users already have this tooling, but here on linux we don’t. At the very least, Nvidia should be able to acknowledge the absence of this monitoring on linux is a real deficiency for linux developers. Please do escalate ASAP.

2 Likes

Thanks - I will definitely use this as ammo.
For interest - what cable did you replace - was it the GPU Power cable ?