Also i have access to one PC which has this problem so if it is possible i can show this problem via SSH console.
OK i have 2 computers on which this is working badly.
I prepared bug reports.
http://paste.ubuntu.com/p/dw4dN8kt2Q/
nvidia-smi takes about 12 seconds
adjusting all fans takes like 4+ minutes :)
root@simpleminer:/home/miner# time nvidia-smi
Mon Sep 10 15:44:33 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... On | 00000000:01:00.0 Off | N/A |
| 84% 72C P2 171W / 225W | 197MiB / 8119MiB | 68% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 107... On | 00000000:02:00.0 Off | N/A |
| 86% 70C P2 151W / 170W | 197MiB / 8119MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 107... On | 00000000:03:00.0 Off | N/A |
| 84% 70C P2 198W / 180W | 197MiB / 8119MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 1080 On | 00000000:04:00.0 Off | N/A |
| 95% 71C P2 165W / 210W | 265MiB / 8119MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 1080 On | 00000000:05:00.0 Off | N/A |
| 99% 70C P2 164W / 210W | 265MiB / 8119MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 1080 On | 00000000:06:00.0 Off | N/A |
| 80% 70C P2 145W / 190W | 265MiB / 8119MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 1080 On | 00000000:09:00.0 Off | N/A |
| 97% 70C P2 164W / 210W | 265MiB / 8119MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A |
| 80% 64C P2 160W / 190W | 265MiB / 8119MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1392 G /usr/lib/xorg/Xorg 5MiB |
| 0 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 179MiB |
| 1 1392 G /usr/lib/xorg/Xorg 5MiB |
| 1 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 179MiB |
| 2 1392 G /usr/lib/xorg/Xorg 5MiB |
| 2 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 179MiB |
| 3 1392 G /usr/lib/xorg/Xorg 5MiB |
| 3 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 247MiB |
| 4 1392 G /usr/lib/xorg/Xorg 5MiB |
| 4 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 247MiB |
| 5 1392 G /usr/lib/xorg/Xorg 5MiB |
| 5 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 247MiB |
| 6 1392 G /usr/lib/xorg/Xorg 5MiB |
| 6 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 247MiB |
| 7 1392 G /usr/lib/xorg/Xorg 5MiB |
| 7 3406 C /root/miner/z-enemy-v1.18-cuda9.2/z-enemy 247MiB |
+-----------------------------------------------------------------------------+
real 0m12.736s
user 0m0.000s
sys 0m0.791s
nvidia-smi (takes only 0,5 second)
changing fanspeed real 1m27.233s
http://paste.ubuntu.com/p/rDT9dBRN8K/
root@simpleminer:/home/miner# time nvidia-smi
Mon Sep 10 15:51:31 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... On | 00000000:01:00.0 Off | N/A |
| 66% 65C P2 77W / 83W | 153MiB / 6078MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... On | 00000000:02:00.0 Off | N/A |
| 58% 63C P2 62W / 83W | 153MiB / 6078MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 106... On | 00000000:03:00.0 Off | N/A |
| 57% 65C P2 91W / 90W | 153MiB / 6078MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 106... On | 00000000:04:00.0 Off | N/A |
| 62% 66C P2 84W / 90W | 153MiB / 6078MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 106... On | 00000000:06:00.0 Off | N/A |
| 69% 65C P2 80W / 83W | 153MiB / 6078MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 106... On | 00000000:07:00.0 Off | N/A |
| 67% 65C P2 80W / 83W | 153MiB / 6078MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 106... On | 00000000:08:00.0 Off | N/A |
| 64% 65C P2 70W / 83W | 153MiB / 6078MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 106... On | 00000000:09:00.0 Off | N/A |
| 53% 65C P2 76W / 83W | 153MiB / 6078MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1525 G /usr/lib/xorg/Xorg 5MiB |
| 0 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 1 1525 G /usr/lib/xorg/Xorg 5MiB |
| 1 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 2 1525 G /usr/lib/xorg/Xorg 5MiB |
| 2 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 3 1525 G /usr/lib/xorg/Xorg 5MiB |
| 3 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 4 1525 G /usr/lib/xorg/Xorg 5MiB |
| 4 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 5 1525 G /usr/lib/xorg/Xorg 5MiB |
| 5 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 6 1525 G /usr/lib/xorg/Xorg 5MiB |
| 6 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
| 7 1525 G /usr/lib/xorg/Xorg 5MiB |
| 7 31397 C /root/miner/z-enemy-v1.18-cuda9.1/z-enemy 135MiB |
+-----------------------------------------------------------------------------+
real 0m0.457s
user 0m0.004s
sys 0m0.149s
OK i made some improvment in my fanspeed script.
It do not executes separate command to each gpu but it executes one nvidia-settings with multiple -a parameters.
Now it works much faster but still from my tests it seems that newer nvidia drivers are significantly slower in nvidia-settings/nvidia-smi commands.
Is there meaby something that we could adjust in kernel to be able to communicate with nvidia gpus faster ?
Meaby lowering CUDA computing prority over nvidia-smi requests ?
I even have some PC’s in which nvidia-smi takes:
real 1m46.630s
This also occurst in newest nvidia drivers 410.57 and 4.15.2 kernel :(
And also executing this on some computers take like 10+ minutes which is totally unacceptable :)
DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=70 -a [gpu:1]/GPUFanControlState=1 -a [fan:1]/GPUTargetFanSpeed=70 -a [gpu:2]/GPUFanControlState=1 -a
[fan:2]/GPUTargetFanSpeed=70 -a [gpu:3]/GPUFanControlState=1 -a [fan:3]/GPUTargetFanSpeed=70 -a [gpu:4]/GPUFanControlState=1 -a [fan:4]/GPUTargetFanSpeed=70 -a [gpu:5]/GPUFanControlState=1
-a [fan:5]/GPUTargetFanSpeed=70 -a [gpu:6]/GPUFanControlState=1 -a [fan:6]/GPUTargetFanSpeed=70 -a [gpu:7]/GPUFanControlState=1 -a [fan:7]/GPUTargetFanSpeed=70 -a [gpu:8]/GPUFanControlState=
1 -a [fan:8]/GPUTargetFanSpeed=70 -a [gpu:9]/GPUFanControlState=1 -a [fan:9]/GPUTargetFanSpeed=78 -a [gpu:10]/GPUFanControlState=1 -a [fan:10]/GPUTargetFanSpeed=70 -a [gpu:11]/GPUFanControlS
tate=1 -a [fan:11]/GPUTargetFanSpeed=70
OK guys.
Thanks to user (filemissing) i was able to confirm which is casuing this problem.
The problem shows its signs bigger and bigges when we are using lower and lower powerlimit.
Here are my benchmarks on 4xx driver (but it does not matter what drivers it is on):
Setup is 12x p104 GPUs
powerlimit set to 180 watts (acceptable):
nvidia-smi : about 1 seocnd
nvidia-settings command that changes fanspeed on all gpus: about 8 seconds
powerlimit set to 150 watts (problems shows its signs):
nvidia-smi : about 4 seconds
nvidia-settings command that changes fanspeed on all gpus: about 90 seconds
powerlimit set to 200 watts(acceptable):
nvidia-smi : about 1 second
nvidia-settings command that changes fanspeed on all gpus: 7-16 seconds
powerlimit set to 120 watts (problem is so big that it is dangerous !):
nvidia-smi : 120++++ seconds
nvidia-settings command that changes fanspeed on all gpus: half an hour or more :)
Ofcourse NOT using powerlimit is not solution as we want to lower powerlimit on cards to have the same performance while less wattage. This is global problem on global scale that most of mining users are getting.
I reported this to nvidia as here noone is answering :(
Bug reported via email: Bug id 2415717 - Performance issue [Incident: 181005-000056]
We already lost like 30 gours on that subject.
Are there any updates on this subject?
I have contacted nvidia dev team two different ways but they do not give a …
Well since then few new driver versions were released, i am not sure if this fixed itself.
You can try our image from simplemining.net
Simplemining
This seems to be working and tle load average is like 2.5 while it is mining (computing)
SM-5.0.21-3e-a19.30-n430.64-v1255.img.xz
Hi tytanick,
Apologies for late reply.
I will try to replicate issue locally so that dev can investigate on it further.
But I will need repro steps and nvidia bug report for the same.
Also provide executable code/ application which triggers the problem.
It would be good to know minimum number of GPUs needed to repro issue.
Also share kernel config file if you have done any changes during kernel compilation.
We are tracking this issue in bug 2415717 [internal]. Please provide the information needed to reproduce this issue.
We have made fresh tests on the same setup as before and here are the results:
Kernel 5.0.21-4 driver nv430.40
- 120W powerlimit - super fast responses
- 100W powerlimit - fast response
- 90, 80W powerlimit - we see some little slower response but it is still all ok
Kernel 4.17.19-14 driver n418.43
- very slow repsponsesn ,totally unusable and 25+ load in system
Kernel 4.17.19-17 nv430.40
- response work nice
- at 80w powerlimit there is little slower but still all fine
It seems that 430.4 drivers solved this problem !
How many GPUs?
Which GPU models?
Which CPU?
How much RAM?
I tested 435 (not 430) drivers (almost all 435 versions) and they are indeed lighter on the CPU but gave measurable and consistent 2-3% performance loss on all compute tasks on GTX1070, 1070Ti, 1080Ti … that’s not a solution for me sadly. I keep the GTX1070 and 1070Ti at around 100W limit.
Would you have time to test drivers 415.27 with the 5.0 kernel and compare the actual benchmark you get with 430.40 and 5.0 kernel?
p.s. I don’t think the kernel makes a difference.
[quote=“”]
That is true, newest drivers are hashing 2-3% slower.
Kernel did not improved anything, only driver “fixed this” by slowing mining down ?
So i guess the case is still not solved as latest drivers work slower.
Our platform
G4400
13x Asus P104-100 4GB
H110 Pro BTC
8 GB ram
Thank for the experiments, please help to provide detailed information as per comment #31 in order to replicate issue locally for debugging.
The best way to do that in my opinion would be sharing access with my 13 GPU P104 Computer on which this problem exists.
Basically we need 13x P104 computer on which we have for example Kernel 4.17.19-14 driver n418.43 and when mining programm is running (computing process).
Can we do that in that way ? I can send SSH details to machine on which this is all set up.
You could do anything you want on that PC and test as much as you like.
But we would need to do this over email for security reasons.
Please tell me if that is good idea or not ?
@tytanick That’s a good idea and a generous offer
@amrits It would be great if you took up @tytanick’s offer.
To be honest, you can very easily see on any 1070 or 1070ti that 435 drivers are consistently 2-3% slower in all GPU compute apps in raw GPU performance, but at the same time lighter on the CPU – this is an undesired compromise. You should be able to reproduce that without any reports from us as it’s immediately visible, but it would be great if you took up @tytanick’s offer.
I can add one more thing.
I am the CEO of simplemining.net and we have multiple setups in our lab and thousands of clients using nvidia (amd too).
So we can easily spot bad things like that.
So there are few problems right now with nvidia driver:
- Problem with nvidia-smi (beeing very slow) with P104 mainly but on some other GPUs it happens too.
- The newest drivers are indeed much slower.
Because of those two things and also because we have may images ready to be tested i can offer thing like this:
ssh to machine, and then i can give You access to our panel to take a look how it looks like (how the computing speed is affected).
And i can also give you one command which will reflash automatically current OS on which you will be testing to different kernel/nvidia driver version.
This way you will have very easy 100% complete enviroment on which you can make tests on one kernel+driver and then reflash with one command to different image that has different version of kernel and nvidia driver.
This would allow nvidia team to test and see for theirselves what is going on.
On top of that there is one more issue with nvidia driver i would be willing to solve as it affects many people minign on GPUs cards.
3. The 2xxx GPU series have bug/problem with fans.
Previous versions of GPUs seen ONE software fan per ONE GPU card.
Now with the 2xxx (some series and also i think 1660 too.) have strange thing that nvidia-settings sees on most cards TWO fans per GPU.
The problem is that we are using our fanspeed managment script which is adjusting fanspeed on every GPU as user wants to as the default adjustments are just terrible. Anyway in setup like 1060, 1060, 2060, 2060, 2060 nvidia tells us for example that there is 8 system gpu fans total and we have no f…ng idea which gpu has how many system software fans…
Therefore we see only 8 fans per 5 gpus and we do not know whych has how many and therefore we cannot control fanspeed in gpus as we do not know if fan0 and fan1 are for one or for two cards or maybe fan0 is for GPU0 and then fan1 and fan2 is for GPU1…
As this bug exists for many days, we decided that we are turning off our fanspeed managment script for systems in which we see strange number of system GPU fans… If there is X GPUs then if there is X fans or 2X fans then we know how to proceed. but if there is any other number like 5 GPUs and 9 FANS then we are ignoring our script.
Also the second part of that problem is the fact that nvidia-smi also does not know which GPU has which FAN :)
For example we see on nvidia-smi outpus things like that:
GPU1 80% fan
GPU2 0% fan
GPU3 0% fan
GPU4 0% fan
GPU5 33% fan
This is because if card has two fans and one is set to 0% and the second to 50% then nvidia-smi can see this as 0% or as 50% … random …
This is bug for sure and i would appreciate to get those things fixed.
Last time i contacted nvidia support they ignored me.