Thank you, @prodigit80!
Let Nvidia handle the voltage.
So, would you say that it is an oversight on the part of NVIDIA that Windows users can modify the frequency-voltage curve with MSI Afterburner (as shown in the first link above)?
An example of frequency-voltage curve would be:
My understanding is the GPU works according to such a curve.
When the GPU performs a minor task it can run at a lower frequency with less voltages, when it has to do something more computationally intensive - it will run at higher frequency requiring more voltages.
If it is possible to modify this curve on Windows, why shouldn’t it be possible to modify it on Linux?
The problem with voltage is that if you even have a 0.01V too low setting for a split second, your GPU driver could crash.
Those making changes to the NVIDIA settings, be it overclocking or changing the power limit, are advised to stress-test. On Linux, this can be done with a range of tools such as Unigine Valley/Heaven/Superposition, or games. Here I have a few questions:
- What amount of stress-testing is necessary to make sure that the GPU is operating errorlessly? Can all the crashes be detected with the human eye, or, to put it differently: are there anomalies that may go undetected?
- I use the NVIDIA card mostly for CUDA computations. Should I be concerned that there might be some faulty results if I do not stress-test thoroughly, i.e. might it happen that I will not get any errors/warnings but get wrong results? Or should I expect that if something is wrong with my changes the GPU will simply stop working and the operations will not be carried through?
- Are power/GPU clock changes hardware safe? can they damage the GPU?
Setting the power limit, and overclocking, will allow Nvidia driver to regulate the voltage by itself.
- If it self-regulates, why does the GPU still crash when I overclock (e.g. by 400 MHz)? Are the crashes the result of power/overclock changes per se or because of the new voltage levels chosen during self-regulation?
- As it regulates itself, does it mean that a positive GPU clock offset will result in more voltage being drawn?
- Similarly, does a decrease in power limit result in less voltage being drawn?
As far as I know, power is computed as follows: P (W) = I (A) × V (V)
So, by reducing the max power limit, does it mean that at each frequency the GPU runs on less power? And if so, is it the voltage, V, that is being reduced? Would this be a good way to undervolt?
Or does this mean that the GPU will run only up to the frequency/voltage level corresponding to that max power.
By taking again the picture above as our example, if the GPU consumes 120 W to run at 1900 Mhz (1010mV), and we reduce it to 115 W, the max voltage (corresponding to the max power setting) at which the GPU will run will be V = 115 / (120/1.03) = 987mV corresponding to ~ 1870 Mhz. So, if we reduce the power to 115 W the max attainable frequency would be 1870 Mhz, right?
- What does the GPU clock offset actually do? Does it move the entire curve by that offset?
Assume there is a stock max frequency (i.e. the hard-coded max frequency at which the GPU can run) level of 1900 Mhz and the min frequency on the curve is 1350 Mh (just as in the first picture). If I successfully set an overclock offset of 550 Mhz, am I right to think that my GPU will always operate at 1900 Mhz consuming 700 mV?
Of course, the benefits of overclocking gradually towards lower voltages allows for a better stress-testing.
sudo nvidia-xconfig --cool-bits=24
- Did you mean to write 28?
If you set this value too high, or too low, the driver will tell you in terminal, what the limits of that GPU are.
- Are there any risks associated with setting the power at the min/max capacity?
If you have more than one GPU, you’d have to do the command:
sudo nvidia-xconfig --enable-all-gpus
before you do:
sudo nvidia-xconfig --cool-bits=X
This ‘–enable-all-gpus’ command, may break your desktop experience, as many Linux versions need special installation commands (only working on 18.04 or before), to run multi-gpus.
I have an Intel iGPU and a Nvidia GTX 1050 Ti (Max-Q) dGPU on my laptop.
I do not experience desktop troubles. What I experience is that I cannot reboot after I run
sudo nvidia-xconfig. I have posted about this here:
nvidia-xconfig breaks boot.
Among the identified/suggested solutions are:
sudo nvidia-xconfig to create the
Open the file and navigate the the Section “Device” and :
- Comment out
Option "Coolbits" "28" (for some reason running
sudo nvidia-xconfig --cool-bits=28 adds
Option "Coolbits" "28" to Section “Screen” and not to “Device”)
Suggested here. Create a
myxorg.conf file in
Identifier "Nvidia Prime"
Option "Coolbits" "28"
Form the author:
I believe setting the option that way applies it on top of the autogenerated Xorg configuration, so there should be no conflict with what Ubuntu does behind the scenes.
This solution has been criticized in the comments:
The problem with that myxorg.conf file you tried is that you use a “Device” section. That’s forcing a certain setup. If there’s several Device sections, they clash with each other and things break.
Screen 0 "nvidia"
Option "Coolbits" "12"
myxorg.conf file in
/etc/X11/xorg.conf.d/ with the following content:
Identifier "my nvidia settings"
Option "Coolbits" "12"
Are the steps you suggest (download .run file drivers, uninstall .deb files, reinstall) aimed at preventing the issue I have mentioned above from happening?
Would you recommend any solution from those enunciated above? If not, what is wrong with them?
Since on Ubuntu 18.04, I cannot use both GPUs at the same time, i.e. I have to
prime-select one, why do I bump into this issue?
download the new .run file drivers