I run/manage a rendering farm with a bunch of RTX 30XX cards. They are slightly overclocked and run 24/7. We try to keep them as cool as possible but from time to time we have crashes because some cards start to run hotter.
We had to run a benchmark for 12h using Windows and HWINFO64 to understand what was going on until we saw the memory junction temperature getting around 100C (max: 108C) until the card crashed.
We understand that this is related with GDDR6X memory and how the chips work.
However, we are unable to correctly monitor the memory temperature because we run our cluster under Linux.
Our goal is to keep the investment on these cards but without proper monitoring and the risk of loosing cards due to high temps is hard to justify not going to AMD or older NVIDIA cards.
Note: we use Prometheus Exporters/Grafana to monitor the host and each card. Unfortunately, due to the lack of support in Linux, the exporter is also not able to export the memory junction temperature.
Thatās great! Are you able to share any more information (when did you start addressing this/when was the ticket raised, any progress on the fix, any ETA for the fix)?
Please prioritise this, Iāve been stuck a month without being able to use my expensive card because I canāt check vram temperatures when training models and donāt want to burn my card. This should be of the highest priority since we have no idea what our hardware is being exposed to during full loads.