Blackwell Pro 6000 MHz degrading

After 6 months of 80%+ utilization of these workstation cards, I notice they are beginning to fail. Specifically the max Mhz at 600W is slowly decreasing (if not under load, nvidia smi reports 2500MHz+). I’ve sent 2 back to the vendor for investigation but also notice 2 more starting to exhibit issues - These show up noticeably in training graphs as all the the “healthy” GPUs are waiting for the failing GPU to sync gradients.

Anyone else seeing this? Temps seem fine, no errors on the bus, just Mhz slowly degrading to 500MHz worst case under load (600W)

Further debugging: I’ve moved these over to another server and they still throttle - this pretty much rules out PSU/motherboard issues. This is very similar to the issue reported here:

I’m currently at 800Mhz and 1400Mhz but there is a distinct downward trend (I log the power curves through wandb). The other healthy gpus average about 2500 Mhz