Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

CMP products also lacking mem_temp display. NIDIA-Microsoft relations are more realistic reason IMHO…

Interesting. Anyway for the benefit of @nadeemm since there seems to be some confusion about what is being requested:
Imgur
This is from Windows. We’d like to be able to read the Memory Junction temperature of GDDR6X in linux as we can in Windows. Not memory case temp.

1 Like

We are all waiting for it but as they don’t provide the firmware update tool for linux either.
https://nvidia.custhelp.com/app/answers/detail/a_id/5165/~/nvidia-resizable-bar-firmware-update-tool
It seems they just don’t care for linux users.

Well, actually they do. nvflash. For some reason it’s not distributed by Nvidia themselves (neither is the Windows version), but it is an Nvidia product. It’s hosted by TechPowerUp here. It’s in the AUR even.

>_ nvflash:

NVIDIA Firmware Update Utility (Version 5.660.0)
Copyright (C) 1993-2020, NVIDIA Corporation. All rights reserved.


-- Primary Commands --
Update VBIOS firmware:           nvflash [options] <filename>
Save VBIOS firmware to file:     nvflash [options] --save <filename>
Display firmware bytes:          nvflash [options] --display [bytes]
Change the start address:        nvflash [options] --offset [start]
Display firmware bytes in ASCII: nvflash [options] --string
Check for supported EEPROM:      nvflash [options] --check
Display VBIOS version:           nvflash [options] --version [<filename>]
List adapters:                   nvflash [options] --list
Compare adapter firmware:        nvflash [options] --compare <filename>
Verify adapter firmware:         nvflash [options] --verify <filename>
Verify adapter IFR firmware:     nvflash [options] --verify --ifronly <filename>
Display GPU ECID/PDI:            nvflash [options] --ecid
Display License information:     nvflash [options] --licinfo <filename>
Generate a License Request File: nvflash [options] --licreq <filename>,<reqType>
Provide a HULK license file:     nvflash [options] --license <filename>
List out all the PCI devices:    nvflash [options] --lspci
Access PCI Configure register:   nvflash [options] --setpci
Display tool building information:nvflash [options] --buildinfo
Display GMAC MCU version:        nvflash [options] --querygmac
Update GMAC MCU firmware:        nvflash [options] --proggmac <filename>.rom
Save GMAC MCU firmware to file:  nvflash [options] --savegmac <filename>.rom
List GMAC MCUs:                  nvflash [options] --listgmac
Write protect EEPROM:            nvflash [options] --protecton
Remove write protect:            nvflash [options] --protectoff

Press 'Enter' to continue, or 'Q' to quit.

Note the NVIDIA Firmware Update Utility (Version 5.660.0) Copyright (C) 1993-2020, NVIDIA Corporation. All rights reserved.

Now I came to this thread because I also am furious about the lack of memory temperature readings especially since I have a 3090 and that’s a serious issue on those cards, but still they do actually have a native Linux firmware flashing utility, and it’s exactly what I used to flashed the Resizable Bar VBIOS update from EVGA onto my 3090. The link you posted was for a ReBAR-specific firmware updating tool (no idea why they needed that), but the regular firmware flashing tool (which works for updating to Resizable BAR support) has a native Linux version.

I got my 3090 @ Micro Center on launch day so obviously it didn’t have ReBAR support, and so this tool was how I got it:

Screenshot_20211231_124105

As far as the original topic goes, I know there are sensors for the memory thermals, so there’s literally no way it’s not possible to expose that through the driver to the user.

Honestly NVAPI not existing for Linux is already ridiculously stupid (especially since NVAPI stuff like DLSS actually work in Wine/Proton). But what’s more stupid is that (almost) all of Nvidia’s hardware monitoring and control is tied to libxnvctrl, the NV-CONTROL X extension, which means nvidia-settings and anything else that uses NV-CONTROL (like GWE and any community fan control or overclocking utility/application) don’t work in Wayland. This is outrageous, and they need to move to a sysfs-based approach like AMD, which has nothing to do with which display-server you’re running (or if you’re running one at all).

Although I think most of the main points have been covered, I would like to add in my voice to request this to become a reality. I would really like to be able to have memory temperature information on Linux. It’s been almost a year since this (relatively) simple feature has been requested, and in my opinion this would create a huge QoL improvement for many users. Please add it in as soon as possible.

1 Like

This thread is now officially comical. First @wpierce you say a developer is on it and it’s a priority then 4 or 5 months later nadeemm comes in and says it won’t be provided, either mistakenly or deliberately conflating Tcase for Tjunction and then spamming the thread with “please open another thread” and “what’s your use case” requests. To what end??? Disperse the concentration of ire maybe?? omg so much laughing-too-hard my kidney hurts.

I’ve found my alternative so I shouldn’t really care much but I will give you some benefit of doubt and expend the effort here to let you know this bullshit costs you DOLLARS.

We run a decent (midsized) mining operation and have a fair number of 10xx and 20xx cards. We snapped up some 3060 and 3070 cards shortly after launch (march/april of '20?) and they ran fine. We waited on add’l power to the facility so we did not aggressively pursue new cards however we did clear out some old 10xx cards and I did gain a few 3080’s in the meantime (mix of FE, Strix, Aorus and FTW3 cards - about 15 in total). I noticed quickly the variability in clocking mems and the Internet was kind enough to provide answers - mem temp throttles. We op almost exclusively on HiveOS but ofc keep some Windows around, including a test bench or two that I also use for baselining and determining OCs on new model cards.

CONCLUSIVE - the laggards hitting mem temps in the 110s to 121C ranges. I also registered by clamp meter on the wall cord an increase in amp draw (not sure how/if that relates but it def means I need more headroom in my power supplies for these guzzlers.

Bottom line was 6 of 15 3080s couldn’t achieve expected hashrates. Repadded myself. 2 of 6 improved but still getting uncomfortably close to 110 C mem temps in windows on all but unacceptably conservative clocks (two others had to be pulled lower than stock clocks and power limited hard, incl one 3080 FE, just to stop from crashing) Repadded AGAIN using some unobtanium padding, then some ultra-fine-future-nanoparticle-promise-you-the-world stuff with an equally ultra price tag. Finally got some acceptable (sort of) results but not impressive. Not impressive because why do I gotta be my own home-grown mechanic on like $15k of new gear?? And who tf is gonna do this x however many we buy when we get our elec service upgraded?? Yeah, I’m looking forward individually benching 100% of and then cracking open 30-40% of new GPUs before they even start working towards ROI. Because I can’t even reliably field-monitor them. Drink that green Kool-Aid, Oh Yeah!!

So, to make a longer story sort a bit shorter when our add’l 600A service came online we stocked up on some Team Red. Nicer distribution (our usual connect only did Green so we made a new connect. I guess because they are the “underdog” here they try harder. Like Avis in the 90’s) And ofc a LOT less headaches with the hardware.

40x 6800XT, 18x 6900XT and a smattering of 6600s. This is first round. Around 68k iirc. That’s $68k of NOT nVidia cards and $68k closer that AMD is to eating your lunch.
I’m now looking forward to getting hands on some of the new Intel Arc cards. Hear they should be very power conscious and be very competitive on alt algos and won’t have to deal with any of the LHR bullshit on ETHASH either.

I’m hearing from my client that the next round (Feb-ish) should be about another $60k spend and they expect to have about $100k to spend by June/July if The Winter arrives.
Was a learning curve getting OCs right on the Red 6xxx series but it’s done. I’ll happily buy truckloads of more Red if you don’t
B) AT LEAST get some Tj reporting into the Linux kernal/driver/api/whatever the problem is
A) get your QC in line and fix the g-damned thermal padding issues.

Yes, your cards do better on the alt coins than the Reds, which is WHY we buy green in the first place.
But if your QC is asleep at the wheel and your driver development is lying drunk on the floor then fk it. I’d rather stable and earning 24x7x365 Red cards than twitchy, finnicky Green momma’s boys that need to be coddled all the time or just plain underperform. I’m not allowing my clients to pay a premium for crap that only performs as good as the Red equivalent because it has to be bottle fed AND can’t be monitored to know when the next tantrum is coming. I can pay less and be treated better for the same output.

And I’m not buying 3070’s either because how tf do I know if/when THEY start misbehaving? Oopsie. Guess you should have thought that this level of laziness (dgaf??) actually costs real sales.

So? How’s that @nadeem? is close to $200k in sales enough “ammo” or should I open another thread? 🙄

Honestly, you all should be ashamed to share this drivel in public and even more ashamed to attach your real names to it. But it IS some of the funniest shit I’ve read on the internet in a while… 🤣🤣🤣🤣🤣😂😂🙃🙃🙃🙃🤣🤣🤣🤣🤣🤣🤣🤣

And if you DO get your collective heads out of your assess and fix it, I’ll consider start recommending Green again. I consult and manage multiple farms. I am not tied to one client and I am not married to any “team”. I go where the hearth is warm, the wine flows and the steak is cooked pink.

Although maybe Intel will come eat everybody’s lunch. Now THAT would be fun to watch, lolol.

Good luck with this.

1 Like

You’re not the target audience for these cards, and you’re making the planet a worse place for billions of other people.

What you are doing is immoral from every standpoint.

If nVidia or their distributorships were being honest with you they would differ greatly.

At the scale of this farm (mid-smallish), CMP cards are more difficult to attain than 3090’s to the average gamer or DL/ML researcher. You think we’re running around emptying Micro Center shelves or flogging Best Buy on launch day? Who do you think is providing these cards? Do you think nVidia has no power to stop it if they wanted to?

You’ve been played my friend but not by me or the mining industry.

Consider: If you are the squirrel in this world than we are just about the avg Koala or maybe Dingo.

Now ask who are the real 800lb gorillas fisting the money here?

ain’t me…

Quite judgmental for someone so massively ill informed.
Both about myself (whom you know nothing about) and the quite nascent mining industry (which you seem to know about as little of.)

It’s an overplayed narrative so you could be forgiven but for your lack of an open mind and rush to judge instead of dialogue.

Within the space there is tremendous reliance and also investment in renewables.
China “chasing away” it’s mining farms did the planet a relatively huge favor as they were most of the dirtiest parts of the mining industry (very much reliance on coal fired power. Substantially less in the US where a large section of those ops ended up and most of them seeking cheap renewable energy, for example TX). Miners tend to invest in cheap energy (either by building or investing in local utility/state solar farms and similar.)

When we build we design towards largely overprovisioning solar capacity (panels are cheap relative to battery capacity) and ensure we run purely on solar during the day and over provision so that we can also feed the grid (it helps them with daytime surge usage when utils need help the most) Good designs imho generally provide more to the grid during the day than will be used from the grid overnight. Essentially carbon neutral at worst and carbon negative most of the time.

So how’s YOUR PERSONAL carbon footprint compare Mr. Concerned? Is YOUR life carbon negative?

Something about stones and glass houses…

Again, quite judgmental. Also improperly placed anger mostly from being ill informed.

I hope that self righteous arrogance (and ignorance) of yours doesn’t work it’s way into any of the models you are teaching (though how could it not) and hopefully those models won’t affect anyone’s personal well-being. Certainly enough has been said about the prejudices present in ML and AI algorithms negatively affecting some people’s lives.
Perhaps I should judge YOU immoral for your involvement in the industry without knowing you any more than you know me. Turnabout after all IS fair play…

And I’ll also ignore the foolishness of you not realizing we are seeking the same outcomes here so technically I’m your ally in this. I don’t spend enough to have nVidia deliver containers to me off the ship but I do spend more than you. Which might be enough to get the nvid devs in this thread taken seriously by there higher ups and get us both the feature we want.

Or you could just maintain your holier-than-thou attitude and keep pounding sand till you turn blue in the face. Honey or vinegar. Choose wisely.

Whataboutism at its finest, and a false equivalency.

Oh - I must have missed that the long-game goal of cryptocurrency mining was sustainable investment in renewable energy sources and a reduction of global industrial environmental impact, rather than benefiting early adopters by legitimising a novel investment vehicle. You learn something new every day.

I came here, to a SOFTWARE DEVELOPERS FORUM, to add a voice for a feature I require. You decided to try to make some political statement out of it by attacking me directly. Good luck with your life. Not that you deserve any.

“Never wrestle with pigs. You both get dirty and the pig likes it.”

1 Like

I agree, there would be fewer issues if “technical” people considered the ethical and societal impacts of their actions and work rather than maintaining false or fleeting comfort in their “ignorant bliss”.

Just popping in to say, as somebody who managed to purchase a single 2070S card a couple of years ago when GPU prices were “normal” (at least almost) for a while, mostly to run games and other 3D graphics apps, but also mine and sometimes run other CUDA stuff on the side, I’d appreciate it a lot if the Linux driver package some day achieved complete feature parity with the Windows equivalent (sans Direct3D obviously, and GFEx I never installed even on Windows).

It’s seriously annoying how it’s not even possible to properly undervolt the GPU on Linux, it’s sucking tens of watts more power while mining Ethereum than on Windows for no good reason. Actually, I’ve resorted to just running Windows (despite hating what it has become) on the machine for now ONLY because of this. It’s a real shame, games and the apps I use generally work great on Linux these days (and the CUDA stuff often targets Linux/Unix ONLY), well enough for me anyway.

Personally I couldn’t care less the driver and libraries are closed source as long as generally Everything Just Works and (in many cases better than on AMD too ;) at least as long as they don’t become actively user-hostile software like Windows is these days, and even on my non-Nvidia iGPU only laptop I’m running Xorg instead of Wayland ATM as I’m typing this (xfce/xfwm4 doesn’t support Wayland anyway), but seriously, please don’t skimp on features like these even though I realize they may not get used on professional CAD workstations, datacenters with lots of cards running GPU compute stuff, etc. which are probably scenarios where most of the Unix driver use happens instead of hobby desktop computing/gaming and very small scale mining.

Would hate to be forced to go look at AMD or Intel for my next card in a few years.

1 Like

I have 3090 ASUS OEM model, When gpu temperature goes beyond 60+ degree, hash rate going down from 117 to upto 100. Is it due to memory junction temperature increase? During this time thermal throttling will happen? or it is safer that hash rate going down during temp increases, means gpu is in safer side? I’m new to mining, Please advise.

Due to linux, im not able to see memory junc temp. Nvidia, Save My GPU in Linux environment. Provide fix to see my memory junc temp.

1 Like

There´s a lot linux user and we need this tool, we do not know why nvidia have this tools for us…

This pisses me off so much. Nvidia’s lab of support, when it 100% seems possible too.

Apparently nvtool is already capable of reading memtemps for certain professional level cards. hiveos added support a few months back:

0.6-212@211124

linux

2021-11-24

  • Added display temperature of memory for Nvidia GPUs equipped with HBM/HBM2 memory e.g. A100, CMP 170HX, etc
  • Updated nvtool to v1.57 (added memory temperature reporting using option --memtemp for GPUs with HBM/HBM2 memory; added option --throttle to show throttle reason which also reported by nvidia-info tool, so you can look all info using it)

Please Nvidia, add display temperature of memory for RTX 3000 series in linux.

We don’t want them burnt.

1 Like

If it wasn’t deep learning I wouldn’t have bought nVidia. nVidia are evil as I’ve started to realise within just 3 days of buying 3090. RAM is melting.

1 Like

So it’s already 3+ months we are waiting for NVidia reply!
Do they even care?
#givememtemponlinux

1 Like

Really need to know the mem temps while running linux. Doesn’t seem like that big of a request… Might have to think about switching to AMD

1 Like

Almost a full year since the issue was raised and so far no visable action taken to address this fault… kind of a shame

2 Likes

I order 100 cards a month. That’s $115,000 A MONTH I’m not buying from Team green. The only reason, Nvidia just got lazy. It’s at the top, so they said, why try. Gamers will never push their cards “that far” so crap thermal pads are ok. But it’s not ok for me. It’s lazy. It’s cheap. It’s what Nvidia is becoming. 3080 and up, prove the point. Their mentality of “just let them run hot” is unbelievable. Who makes such a great product, puts such high end DDR6X on them, then, screw it with a cheap thermal pad. LIke… really? Fine whatever. I can change pads. Put some decent paste on them. No worries.

But this no memory temps in linux? This is just down right laziness. There isn’t any excuse. Temps are the FIRST thing you’d put in. I mean it has to be there, if it thermal throttles it has to know the temps, you telling me it takes months on end to the 12th of never to expose them on the api for linux? What a crying shame nvidia has become.

And it’s us now. Miners pushing the cards and gamers who want to run their cards at levels THEY PAID FOR. And they cheap out on us in the linux driver. Why simply no excuse. But you other areas, AI, Deep compute and the likes, they are cheaping out on them as well. Compute in several cases has a memory load, I know, i run them, never mind render farms and guess what, they need to know their memory temps as well. But those guys are just not as vocal as miners who are “pushing them the hardest” most likely. But my render farm needs this and would save us so much trouble.

I’m sorry, but nvidia should of been “Oh you guys need memory temps in linux, give us a week” i mean come on, it’s nvidia, you guys release WHOLE drivers at the drop of some B-list game. I get it, huge market. But those guys buying gpus 100k+ at a time, we get the middle finger? I don’t know if Nvidia just so big it’d rather this money flow go to team red, you know after you piss off enough people, that pile of cash gets pretty big going the other way.

I would be buying team green, but i will not place another 100 nvidia gpu order with my distributor until nvidia fixes this. and if they don’t… no worries, AMD is taking care of what we need and they funny thing is, we getting about the same hash rate perhaps 4-5% slower but 8-12% cheaper. Sooo, no sweat off my back. High five @user113775

1 Like