This is really needed by the DL community on linux specially on this new RTX cards, any of you guys have an idea how we can implement this on linux ourselves?
I currently have 4x RTX 3090 for my DL machine and I need to fix the issue with memory temps, cause the machine is crashing every time I am doing training, and I did a quick test and replaced the pads on my gpus and now they are working fine, but it crash after 6-8 hours of training now, so this is an issue with the memory temps for sure, and having access to the temps on terminal I can create parameters to lower TDP and optimizations for when I am doing training.
Please Nvidia we are expending a lot of $$ on this GPUs and this is something that I believe can be implemented from your side.