Quadro T2000 throttles down to 300MHz and stays there

Hello,

My configuration is Dell Precision 5540, i7-9850H and Quadro T2000 (specifically NVIDIA Corporation TU117GLM [Quadro T2000 Mobile / Max-Q]) with Fedora 31, (5.3.9-300.fc31.x86_64) and NVIDIA drivers being 440.31 installed from akmod package.

The problem I’m encountering is that basically on all 3D applications, after a while, like 5 minutes or so, the GPU starts to throttle. That is normal and understandable especially on a laptop. But the problem is that it throttles down to 300MHz and does not clock any higher without a reboot. That is basically unusable at that point. The PowerMizer Preferred Mode setting does not affect this at all.

For demonstration purpose I wrote a script that takes current gpu frequency and temperature and appends them to to csv file (attached) and I run TW: WH 2 benchmarks in this order:

  1. battle benchmark (avg fps 47.7)
  2. skaven benchmark (avg fps 48.6)
  3. battle benchmark (avg fps 12.7) as can be seen the effect on performance is huge

I will attach a csv file of the script where on timestamp 1574211968198 a drop in the frequency can be observed which happened pretty near to end of the skaven benchmark. Then also the temperature starts going down but the clock speed never picks up. I will attach screenshots from the runs and also provide the nvidia-bug-report.log

I do realize that this might be compatibility issue with my laptop manufacturer as at the same time the GPU starts to throttle the CPU also starts to throttle. Yet the CPU recovers normally as soon as the temps recover but the GPU does not recover without a reboot.

Files:

nvidia-bug-report.log.gzhttps://drive.google.com/open?id=16ErfsUudEoH4dkbsHN0BIarUEE03JFTb
frequency and temperature csvhttps://drive.google.com/open?id=11uDnWnEw3SW-b6knXNG1ZuiRJBHXH5_L
first benchmark https://drive.google.com/open?id=1_VF2Q5e3ea4F0vZUxj1zCqULdHSXlVlQ
second benchmark https://drive.google.com/open?id=1Zk4r3j9EuMs0tKJ6LSzWIwKsCOlX0Jjr
third benchmark https://drive.google.com/open?id=1RxVZuFE_Gv5wm_mKpnKhswyaQa98vSqv

Hopefully someone here can figure something out, if there are some parameters or settings I could test, I’m more that willing to give them a try!

You should check temperatures of the gpu using nvidia-settings gui or install nvidia-smi and use that. Maybe some bad hysteresis entry in bios is causing this so please check for a system bios update.

Thank you for your reply. The script I wrote uses nvidia-settings to query the temperature and frequency, it just does it in the background and results into a time series data,

I have checked for firmware / bios updates and I’m having the most current one installed already (came as ota update)

https://devtalk.nvidia.com/default/topic/1062020/linux/quadro-t2000-max-q-support/post/5385557/#5385557

Bump maybe?

Since this seems to be an issue with that specific notebook model, did you try to raise an issue with Dell so they could contact nvidia? Furthermore, you could also mail the problem description and nvidia-bug-report.log to linux-bugs[at]nvidia.com

I did post to Dell but they were again totally not helpful.

I think that this is a problem in the NVIDIA driver (and confusion about the Max-Q or non Max-Q variant)

The best possible situation is that the NVIDIA driver does not support the temperature management with the T2000 Max-Q and my laptop has that particular chip in it. Or the chip is most probably 100% same with all the models but the Max-Q just has different power / temperature management that is not supported with the Linux driver.

You shouldn’t put too much attention to the “Max-Q” tag, rather ignore it. Those are the same chips sharing the same pci id which results in the “T2000 / Max-Q” display, those are just vendor specific models with a lowered tdp.
Rather assume that you have a regular T2000. You could install nvidia-smi and check if it displays the power budget. https://www.notebookcheck.net/NVIDIA-Quadro-T2000-Max-Q-Graphics-Card.424172.0.html

Thanks for that answer!

In the .csv file I can see clock speeds going up to 1815MHz so definitely not the lowest power budget model.

I will look into installing nvidia-smi but for what I understood it needs the xorg stuff installed and newest Fedora is not running that.

BTW, did you ever install windows to check if the same issue is not happening there?

Hi,

No I didn’t try it on Windows as I have no interest to run my work PC with that. Fortunately I don’t need the CUDA or anything else taxing the graphics card right at the moment but soonish I may well be running CUDA / OpenCL stuff and at that point I would not like to have it throttling like this.

I have been toying up with graphics cards and PC’s from the times of GeForce 2 / 3DLabs and I’m pretty sure that I have a driver problem so that is also why I haven’t been spending too much of time on changing the OS’s.

I will try Windows at some point when I have time to move all my work stuff to an USB HD or install Windows on one.

I will also try to use back channel communication with Dell (and NV) if no-one picks this up.

But thanks for all the help @generix!

Hello,

I have the dell 5540 with ubuntu 18.04 and experienced the same problem. I realized that the frequency is limited to 300 MHz only AFTER the execution of a CUDA application. The frequency of the clock is not limited as long as CUDA is not used. For example, if I run glmark2 right after booting the clock frequency stays at to 1860 MHz.

@genis_valentin are you sure that it stays there, like did you hit the throttle? For me it goes down only once it has hit the 80C mark.

I had this script running on the background to take the readings:

const { exec } = require('child_process');

const TEMP_COMMAND = "nvidia-settings -q ThermalSensorReading";
const CLOCK_COMMAND = "nvidia-settings -q GPUCurrentClockFreqs";
const getIntValue = (out) => Number.parseInt(out.split("):")[1].split(".")[0].trim());

const getFloatValue = (out) => Number.parseFloat(out.split("):")[1].split(".")[0].trim());

function msleep(n) {
    Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, n);
}

function sleep(n) {
    msleep(n*1000);
}

const main = async () => {
    while (1) {
        let temp, freq = null;
        temp = getFloatValue( await execute(TEMP_COMMAND) );
        freq = getFloatValue( await execute(CLOCK_COMMAND) );
        console.log(`${(new Date()).getTime()},${temp},${freq}`)
        sleep(1);
    }
}

const execute = (command) => {
    return new Promise((res, rej) => {
        exec(command, (err, stdout, stderr) => {
            if (err) {
              console.error(err)
            } else {
             res(stdout);
             if (stderr != null && stderr != "") console.log(`stderr: ${stderr}`);
            }
        });
    });
}

main();

and I run it as “node script.js >> temp_and_freq.cvs”. Instantly when it hits 80C (can be seen from the cvs file in the first post) it will throttle to 300MHz and that is the maximum it will reach after that.

And for @generix I installed Windows to dual boot and had zero issues there. So only Linux issue, haven’t tested with the new drivers yet (440.44).

Hello,
I recently discovered the issue. Is there anything new ? I tried the only three days old driver 440.64, but am still stuck at 300MHz.

I have not bee able to go beyond 300MHz, even without launching cuda applications before.

Hello!

Thanks @sopsaare for the script. I used the script to monitor GPU frequency and I saw that it oscillates between 75 Mhz and 300 Mhz (for a T2000 card), already from system boot up. This happens with drivers 418, 430, 435 and 440 series. When I reverted to driver 410, which came in the original ubuntu installation by Dell, the GPU clock frequency oscillates between 300 and 2100 Mhz as expected.

I attach the bug report produced with

410.104 driver Microsoft OneDrive - Access files anywhere. Create docs with free Office Online.

440.59 driver Microsoft OneDrive - Access files anywhere. Create docs with free Office Online.

@genis_valentin, I can’t thank you enough for this information. Indeed, reverting to the 410.104 driver changes everything. The clock goes up, the graphic benchmarks and tensorflow computations are much better, and even videos on the web are smooth again.

I should have noticed that it was better at the beginning.

Cheers,

B

Good work @genis_valentin.

Now we just need to file to NVIDIA or hope that someone picks this thread up form here. Also might be that it is only problem with the specific Dell model + specific NVIDIA model but I highly doubt that.

Unfortunately the 410 is quite an old driver already so feels bad to revert to so old driver :(

I filed a bug, let’s hope that they solve the issue in future releases! It is a pity to be stuck with 410 as it does not have prime offloading.

https://developer.nvidia.com/nvidia_bug/2884316

Hi All,

Please help to verify with latest Beta driver release and share results.

Hi All,

Please help to verify with latest driver release and share results.