Request: GPU Memory Junction Temperature via nvidia-smi or NVML API

Request
It would be extremely beneficial to be able to monitor the GPU memory junction temperature on Linux through nvidia-smi or the NVML API.

Background
With the latest RTX 3080/3090 series cards using GDDR6X there are growing concerns relating to the temperature of the memory junction. It has been observed that generally performance is throttled at around 110C. As well as hitting this performance throttling, it would be great if Nvidia could expose this temperature through nvidia-smi/NVML API. As far as I know the capability to record these sensors definitely exists in the latest Windows drivers and is exposed through NVAPI. Presumably it’s also in the Linux drivers too, but if not can we get that functionality as well as query access in NVML API or the nvidia-smi tool?

I know there are a bunch of people who want this functionality, so if you’re reading this and are one of them, please post your thoughts or confirm your agreement to convince Nvidia this is a priority.

11 Likes

Agreed! The GPU temperature usually stays further below the throttling temp than the GDDR6X. On Linux this concerns me when putting the card under heavy load, as I can’t check the memory junction temperature there to know if I’m in a safe area.

2 Likes

I have bought 6 NVIDIA RTX 3090s, more than € 10,000 invested to use them under Linux and NVIDIA makes me feel abandoned. If this situation persists, I will replace them and advise against the use of this brand, much to my regret.

1 Like

Agree. We should have this feature

2 Likes

Yes I do have the same problem : with RTX 3080 I have junction memories seen at 95°C while the GPU stay under 55°C . I already experience loss of GPU and reboot. With linux I’m blind and can’t monitor more than GPU temperature, but it’s useless at the end. I didn’t had the issue with RTX 3070, it’s was colder in general.

2 Likes

Totally agreed. Running my GPUs at full load all day rendering without knowing the hot GDDR6Xs’ temp makes me feel like roasting my money.
While windows user have HWiNFO, linux users just don’t have any way to get this temp.

1 Like

There are 6 comments with this but thousands of affected. I do not get it.
Does anyone know how HWINFO64 got it? If we show need, they may agree to guide something where to start. I’d rather spend time trying to develop something open source with public availability than wait for my cards to be roasted. I must pay off my investment in RTX toasters.

Thanks for the support webtech.

HWiNFO64 is Windows only, and can access the memory junction temps through NVAPI (Windows only interface, that the community has been begging for linux support for ages). Frustratingly the devs haven’t yet incorporated that functionality into the NVML API (which is available cross platform). I don’t think there’s anything theoretically preventing them from being able to offer this functionality via NVML API unless the Linux drivers don’t have the functionality to capture these temps like the Windows ones. However, I don’t see why that functionality couldn’t be added to the Linux drivers too.

1 Like

You’re welcome. Then I will go doing what I can and try to share it. As in my case I train large models, I need to operate for several days so I am doing tests. So far I have stabilized the RTX 3090 at <300W at 48ºC for the GPU and 85ºC (115ºC of stock) of temperature in the memory indefinitely measuring for days in Windows to get a correlation between the different temperatures and verify them in Linux. I have not tampered with the graphics card, I have not lost the warranty, only its mounting in / and the box. If interested I share the modifications, I do not want to fill this thread with unrelated text.

+1 on a Linux API to retrieve this. Since the temps in question tend to get pretty high on 3080/3090 cards, a way to monitor these is much needed.

1 Like

This is really needed by the DL community on linux specially on this new RTX cards, any of you guys have an idea how we can implement this on linux ourselves?

I currently have 4x RTX 3090 for my DL machine and I need to fix the issue with memory temps, cause the machine is crashing every time I am doing training, and I did a quick test and replaced the pads on my gpus and now they are working fine, but it crash after 6-8 hours of training now, so this is an issue with the memory temps for sure, and having access to the temps on terminal I can create parameters to lower TDP and optimizations for when I am doing training.

Please Nvidia we are expending a lot of $$ on this GPUs and this is something that I believe can be implemented from your side.

1 Like

any of you guys have an idea how we can implement this on linux ourselves?

AFAIK it is not possible without help from the drivers. Having community made drivers is not really possible, so our only hope come from nvidia. Do not hope too much though. the linux community has been begging nvidia for ages in hope of better driver support.

btw I’d advise you to add some heatskink on the backplate of the 3090 though. It helps a lot to keep the memory cool.

1 Like

Malka’s right. Devs haven’t been given the ability to implement this ourselves because Nvidia aren’t providing access to the required data through the drivers and API.

There are various mods that you can choose to perform (adding thermal padding on the hotspots is very effective), adding heatsinks works well too. Just be very careful and aware of your card’s warranty process (often these mods can void the warranty).

Ideally though, there should be no need for mods. More importantly, even if one decides to mod, it would still be incredibly valuable to monitor how effective said mods are.

Considering Nvidia’s stance for 3000 series being used for professional workloads, I highly doubt this will ever be introduced for Linux.

Considering how they forced the AIBs to pull the blower 3090s over reports of builders opting to stick 3090s in servers over their professional cards. On top of the driver level nerfs to deep learning on 3090s, the situation looks very grim.

They know that their user base on Windows is geared towards gaming but Linux users are 100% for professional workloads so the market already provides a clear segmentation they can further capitalise on.

In principle the solution should be simple if you are not using a Founder Edition:
a) Water cooling: Good luck if you buy one, accepting the extra cost and the loss of warranty in most brands.
b) RISER LinkUP cables, moving its graphic away from the motherboard as much as possible, in my case 35 cm running in Gen 4.

This will only solve half the problem (it will cool half of the memory modules, the ones on the heatsink face).

For the other half of the problem (the main problem) you will need:

  • Fans at maximum speed in the case of maintaining air cooling.
    1- Look on the back of your RTX and you will find the 4 fixing screws of the heatsink with the GPU.
    2- Right around them are located half of the GDDR6X memory chips, they are the objective to be cooled.
    3- Buy one of the DELTA brand fans, in my case I am using AUB0812VH-5E58 with good results. Place these fans about 5 mms away from the Back Plate.
    4- If you add some raspberry PI heatsinks in the area where the backplate supports the memory, you will get working temperatures below 85ºC and your problems will be over.

With this you will preserve your investment although the main problem will persist, you will continue going blind not being able to monitor the status of your RTXs. For this reason, you will have to sacrifice some of your time monitoring temperatures from windows for a few hours, 48h in my case, but because I am very picky.

In the case of having the Founder Edition I do not know the solution but it is screwed.

I have spoken with an ebay shop to bring a good quantity of these fans, they will arrive in a couple of weeks and I can add a post to my blog explaining step by step.

This is all really embarrassing, we are talking about products with 4-digit prices.

@webtech @Malka Thank you for the reply guys, I ended up watercooling the whole machine and adding Noctua 3000 RPMs fans on the case, so far so good, but very loud computer when doing training, which is not a big deal since I have it in another room in the house, the back memory on the GPU still a pain to cool them down, so far I used a backplate from EK with thermal pads and with the powerful fans on the case is working fine, no more crashes, but I spent more money that I anticipate, I hope this GPUs do not die in the next month, cause I have them running 24/7 training DL. For this kind of product and the price tag I think Nvidia should test properly the configuration and offer proper driver for Linux users that are using this GPUs for DL which they know it use a lo of memory in some cases

I agree this is an important request to provide feature parity for something already implemented on the Windows driver. Given the temperature variance between the reported GPU and the memory junction, this is key to avoiding performance throttling due to overheating.

How to get it in windows?

Hardware Info: HWINFO

1 Like

This would be very advantageous to have this added into the Linux driver. My current employer utilizes Linux and we have various cards running renderings daily, having the ability to get the temperatures would allow us to push the cards harder if we are under utilizing them. I hope this is something they are looking into adding now that GDDR6X runs much hotter.