Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

There is a new attempt over here:

If you can get past the sigserv error i get, maybe it’ll work for you. My kernel has

CONFIG_IO_STRICT_DEVMEM=y

so likely that’s why

iomem=relaxed

isn’t working for my system. hopefully we get a solution soon, because I’ve already fried one 3090 mosfet under warranty and want to waterblock, but have no way of checking temperatures in linux and don’t have access to windows. if i wasn’t doing AI/ML serving, I would have moved to another option already. But, apparently the largest base of AI/ML is linux, the best card in nvidia, and jensen keeps saying he loves us, but … my rtx is getting fried and i can’t see the temps.

1 Like

I can confirm that the above code now works, I can see my VRAM temps in linux!

You couldn’t with the NVML API?

Nope. no reference link for nvml api was provided, and googling the topic gave pretty terrible non-instructions and hairballs. The github I pasted has easy to follow instructions, and just works nice and clean.

You supposedly have to pass a value of 15 as the index. No, it doesn’t make any sense.

Would be interested to know if that works for you.

More than happy to try it out for you. And maybe I’m just not that well versed. But, what exactly do I do with that? How do I call that api? is there a cli command, or i write something in c/python script? I’m lost… to poorly paraphrase bones, i’m a physicist not a cs guy.

At least on my 4090, nvmlDeviceGetThermalSettings returns four sets of settings at indexes 0, 1, 2, 15, but all are identified as GPU for both the controller and target, and return the same temp value. So the direct PCI access method in gddr6 remains the only mechanism that has demonstrably worked.

Thanks for the confirmation. The function really makes no sense whatsoever.

Is it possible that high junction temps would cause nvidia-smi -q -d PERFORMANCE to report thermal throttling as activate on a 16xx series card? How does that even work if the temps aren’t readable?

Currently in the middle of a lengthy and annoying troubleshooting process since all visible temps on my card are 30C below where throttling should occur.

It is October 2023 and still NVidia isn’t providing a way to read VMEM temps on an rtx3090 via nVidia-smi. (I haven’t actually tested the third party open source tool yet).

Unless the situation changed in the meantime and I’m not aware this behaviour from nvidia makes it 100% clear what they are doing.

They are using the argument of “so you want to mine crypto on consumer GPUs” to explain away the lack of this crucial feature while we all know what is the truth.

NVidia doesn’t want a random person wanting to do ML research as a hobby to buy a couple old rtx3090s, they want that person to sign up to Aws, Azure or the (not-at-all)OpenAI and pay 10x more for the privilege of using “datacenter grade” products on a per hour basis.

What they don’t understand is that key innovation that happened in IT in last 50 years has always started with an amateur dabbling in his garage/dorm (Apple, Microsoft, Google, Gnu, Linux) and whatever that amateur has access to will probably be the tech the resulting unicorn Co. will use in the future. This will be the same with ML. However, there are signs on the horizon soon NVidia will not be the only game in town for a wannabe amateur ML researcher. We now have M2 ultra from Apple with unified memory(eye watering expensive, but it shows the things to come) and AMD has allegedly a very similar unified RAM zen4 based product ready to go. I’m looking forward to these products.

1 Like

The API used by apps on Windows to get memory temp is different than the one used by nvidia-smi and Nvidia has ported that library to Linux. If Linux’s “many” programmers aren’t willing to create a CLI or GUI app, that’s not really Nvidia’s problem.

Please use

nvmlDeviceGetFieldValues

with a field value of 82(https://github.com/NVIDIA/nvidia-settings/blob/7471c5b584c4d8df8d81c336c01b29b8e4b15b1d/src/nvml.h#L1455)

1 Like

This was a nice lead, but on my system this consistently returns 0. I tried with the Go bindings and couldn’t confirm that any fields were working, so tried Python as well and that was returning valid values for other fields (for example, accurately reflecting ECC status, incrementing a couple other policy counters) but always gave zero for 82.

Ubuntu 22.04
RTX 4090
Driver: 535.129.03

any news?
this is a must ASAP.
@nadeemm , news?

Thanks. It works. I forked it and created a prometheus exporter for it. GitHub - jjziets/gddr6_temps: Linux​ based GDDR6/GDDR6X VRAM temperature reader for NVIDIA RTX 3000/4000 series GPUs.

But now its clear that its not vram or core tempts that is causing the gpu failures. Its the hot spot

Is there away to get the junction hot spot under linux?

There is.