CUDA 11.2 on Windows: Utilization decreased over time, throttling TensorFlow jobs

bart.trzy · March 25, 2021, 1:13am

Hi,

I’ve been having a persistent problem across multiple Windows machines with CUDA 11.2 and an RTX 3090: performance in TensorFlow seems to degrade after some time and never recovers until I relaunch the app. It almost looks like some sort intentional performance throttling.

For example, when training a Keras model, after a couple of epochs, CUDA utilization, as measured by the Windows Task Manager’s Performance tab, drops from 90-100% → 30% and hovers there forever. And, as a result, training iterations begin to take 2x or more as long to complete.

I posted about this months ago on the TF forums and Stack Overflow but no one has been able to confirm. This has happened across two difference machines now (albeit the same GPU). This is not thermal throttling and under Ubuntu, performance is sustained.

Any ideas? I would really like to be able to train on Windows as it’s also my primary dev platform for other non-ML projects.

Thank you,

Bart