If you can get past the sigserv error i get, maybe it’ll work for you. My kernel has
so likely that’s why
isn’t working for my system. hopefully we get a solution soon, because I’ve already fried one 3090 mosfet under warranty and want to waterblock, but have no way of checking temperatures in linux and don’t have access to windows. if i wasn’t doing AI/ML serving, I would have moved to another option already. But, apparently the largest base of AI/ML is linux, the best card in nvidia, and jensen keeps saying he loves us, but … my rtx is getting fried and i can’t see the temps.
Nope. no reference link for nvml api was provided, and googling the topic gave pretty terrible non-instructions and hairballs. The github I pasted has easy to follow instructions, and just works nice and clean.
More than happy to try it out for you. And maybe I’m just not that well versed. But, what exactly do I do with that? How do I call that api? is there a cli command, or i write something in c/python script? I’m lost… to poorly paraphrase bones, i’m a physicist not a cs guy.
At least on my 4090, nvmlDeviceGetThermalSettings returns four sets of settings at indexes 0, 1, 2, 15, but all are identified as GPU for both the controller and target, and return the same temp value. So the direct PCI access method in gddr6 remains the only mechanism that has demonstrably worked.
It is October 2023 and still NVidia isn’t providing a way to read VMEM temps on an rtx3090 via nVidia-smi. (I haven’t actually tested the third party open source tool yet).
Unless the situation changed in the meantime and I’m not aware this behaviour from nvidia makes it 100% clear what they are doing.
They are using the argument of “so you want to mine crypto on consumer GPUs” to explain away the lack of this crucial feature while we all know what is the truth.
NVidia doesn’t want a random person wanting to do ML research as a hobby to buy a couple old rtx3090s, they want that person to sign up to Aws, Azure or the (not-at-all)OpenAI and pay 10x more for the privilege of using “datacenter grade” products on a per hour basis.
What they don’t understand is that key innovation that happened in IT in last 50 years has always started with an amateur dabbling in his garage/dorm (Apple, Microsoft, Google, Gnu, Linux) and whatever that amateur has access to will probably be the tech the resulting unicorn Co. will use in the future. This will be the same with ML. However, there are signs on the horizon soon NVidia will not be the only game in town for a wannabe amateur ML researcher. We now have M2 ultra from Apple with unified memory(eye watering expensive, but it shows the things to come) and AMD has allegedly a very similar unified RAM zen4 based product ready to go. I’m looking forward to these products.
The API used by apps on Windows to get memory temp is different than the one used by nvidia-smi and Nvidia has ported that library to Linux. If Linux’s “many” programmers aren’t willing to create a CLI or GUI app, that’s not really Nvidia’s problem.
This was a nice lead, but on my system this consistently returns 0. I tried with the Go bindings and couldn’t confirm that any fields were working, so tried Python as well and that was returning valid values for other fields (for example, accurately reflecting ECC status, incrementing a couple other policy counters) but always gave zero for 82.