Question about RTX 6000 Ada slowing under load

bigharryox · December 16, 2023, 1:34am

I recently began adding 6000 ada cards to my workstation and noticed that the performance bump when doing my ML workloads wasn’t a significant bump over the A6000.

Digging it bit, it seems like the performance is dropping significantly when the card starts to warm up. Looking into nvidia-smi in linux, noticed some weird things:

 Temperature
    GPU Current Temp                  : 46 C
    GPU T.Limit Temp                  : 36 C
    GPU Shutdown T.Limit Temp         : -7 C
    GPU Slowdown T.Limit Temp         : -2 C
    GPU Max Operating T.Limit Temp    : 0 C
    GPU Target Temperature            : 85 C

The numbers (particular the GPU T.Limit and GPU Slowdown T.Limit) seem weird, not sure why slowdown limit is -7C but that seems really weird.

This seems to be the case on both my 6000 ada cards, so it doesn’t seem to be a fluke.

This vs. my A6000:

Temperature
    GPU Current Temp                  : 37 C
    GPU T.Limit Temp                  : N/A
    GPU Shutdown Temp                 : 98 C
    GPU Slowdown Temp                 : 95 C
    GPU Max Operating Temp            : 93 C
    GPU Target Temperature            : 84 C

Not sure why there is no T.Limit, but the other numbers at least seem sensical.

Not sure if this is the issue, but it’s worrying me that it starts to seemingly throttle at normal temperatures. Is this something that can be adjusted, or is there a reason why the numbers would be so strange?

bigharryox · December 16, 2023, 1:40am

Some other details from nvidia-smi -q

Driver Version : 535.129.03
CUDA Version : 12.2

Attached GPUs : 7
GPU 00000000:42:00.0
Product Name : NVIDIA RTX 6000 Ada Generation
Product Brand : NVIDIA RTX
Product Architecture : Ada Lovelace
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
VBIOS Version : 95.02.3A.00.01

oceanbeach · December 19, 2023, 7:38am

I have had this issue since I purchased the card. Mine is about half the speed of my previous gen rtx a6000.

everything appears fine until the gpu is under load, then it immediately downclocks to 500mhz. it also only draws 35-36% TDP. regardless of workload.

see the gpu-z sensors attached. in it, the gpu is idle, then under load, then idle.

oceanbeach · December 19, 2023, 7:40am

@bigharryox what psu / mobo / cpu / etc do you have?

mine: superflower leadex platinum 1600w / asus wrx80e sage / threadripper pro 3x75…

BlueGoliath · December 19, 2023, 2:30pm

FWIW, if you’re on Windows, you can try using my application to try to find the issue. It is based on NVML and has an extension for NVAPI that adds more information reporting and management abilities. I would look at temps, power draw, performance state, and performance limiters when this happens. Using the NVAPI extensions’s ability to disable dynamic performance states might be helpful.

It has only been tested on GeForce cards so don’t be surprised if it doesn’t work.

oceanbeach · December 19, 2023, 7:19pm

Very useful app, thanks.

Things I learned:

The card is always in Hardware Slowdown Performance Limit = Active. Idle or full load.
At load, Software Thermal Slowdown Performance Limit = Active. Even though all temps appear fine.
The card never enters P0 state. At load it seems to like P2, and at idle it bounces between P2 and P3.

I also tried this card in another pc and it worked fine, so something is weird with my workstation. The psu failing would be more obvious I think…the system would be unstable, especially at load, but it is perfectly stable.

Would a windows or bios setting cause this limitation? Asus WRX80 board.

Edit:

I tested my 6000 Ada again in the pc where it is slow. This time I gave it its own PSU! No change.

I am now thinking motherboard config / bug / failure…

BlueGoliath · December 20, 2023, 3:35pm

Some people have resolved issues by updating the motherboard BIOS. YMMV. Don’t know what else to suggest, sorry.

bigharryox · December 20, 2023, 6:38pm

I have Asus WRX80 Sage / Threadripper pro 5965wx / dual PSU config.

bigharryox · December 20, 2023, 6:40pm

FWIW, if you’re on Windows, you can try using my application to try to find the issue. It is based on NVML and has an extension for NVAPI that adds more information reporting and management abilities. I would look at temps, power draw, performance state, and performance limiters when this happens. Using the NVAPI extensions’s ability to disable dynamic performance states might be helpful.

Ok, this is primarily linux, but I can boot into Windows to give it a try.
In my case, i am not having a problem with significant downclocking immediately, but see this occur over time apparently as it starts to heat soak.

BlueGoliath · December 20, 2023, 10:32pm

Two people with the exact same motherboard / GPU config strongly point to a motherboard issue. I would try to update BIOS. If that doesn’t work, contact Asus.

alstonn · February 12, 2024, 2:33pm

Hello @bigharryox…im glad i found this post as it has given me a little insight into my own issue here.

I have a similar setup as you and nvidia-smi gives the same temp info which I also found strange. The default nvidia-smi data will after time give me an err flag in the wattage area which led me to believe that either the GPU was dying, the MB was dying or the PSU was dying.

I had just gotten done with an upgrade of RAM and the situation was a massive headache as I could not get the workstation to post upon this install. My RAM dealer recommended that I update the BIOS which I did and this then allowed the workstation to boot.

After this, this issue of the RTX 6000 Ada performance capping half of is power popped up. I am running some serious AI pipelines as well as heavy Unreal dev. AI was the revealer. Once I had started to get back to work the processes I am running started to shift its load over to the CPU on the more intense actions. Also, the RAM was running hot, over 80c one topped out at 90c, not good for sustained loads.

I was thinking oh no I have to buy new GPU, very expensive for us here, so it had me sweating a bit. After doing dome deep diagnostics and poking around I decided to also update the firmware for the IPMI. Upon doing this I got a message that the PSU was failing. Now I know that the BMC can sometimes give false positives so Im not sure if this is exactly the reason but it would make sense that this is the issue if it is true. Thankfully I have the exact PSU in an unused workstation right now so I am in the process of a swap to see if that is the issue, or not.

Were you able to resolve your issue? If so what was the cause?

alstonn · February 12, 2024, 3:14pm

Swapped out the PSU and same result. GPU AI load shifts to CPU. Unreal framerates cut in half. So we are down to the MB and GPU. If anyone has any ideas as to why this might be happening it would be a huge help in a time of need.

bigharryox · February 12, 2024, 6:36pm

Haven’t been able to figure out what’s going on in my setup unfortunately.

I did some more tests and in my case it does seem like it’s temperature related, i don’t know how to differentiate between RAM vs. general temperature sensitivity unfortunately. Since then I have converted to a water-cooled setup which seems to have reduce the effect fairly noticeably, but when running everything at maximum and as the temperature climbs, the 6000 ada seems to be the most affected by temperature increases (i have a6000s in the system which are far more performance stable than the 6000 ada is).

That said, i still don’t know why that’s the case because everything is running now well below the thermal limits and i can see it throttling down pretty quickly as the temperature climbs. I don’t think it’s a PSU thing in my case, but i am not sure how to evaluate that.

@alstonn how did you get a message about the PSU?

alstonn · February 13, 2024, 10:11am

Hey @bigharryox , thanks for the reply. So some good news, I think, the 6000 seems to be pushing near full power again on AI processes. It wasn’t the PSU so the BMC is giving false info. I swapped PCIe slots and things seem to be working better again but still not as fast prior to the RAM upgrade. I’m thinking its the motherboard now. I’m going to order a new board, the newer WRX80 Sage II, and test that to see what comes of it. I’ll let you know how it goes.

The PSU info sensor is via the BMC/IPMI Admin UI.