RedHat 7.4 with Tesla P40 * 4 work abnormal with driver 384.81 and 384.125

I have several servers(Dell T630) with 4 Tesla P40 running RedHat 7.4 with kernel 3.10, with nvidia drivers 384.81 or 384.125. One of the servers work abnormal when I install all the driver and tools. It’s power usage is much higher than other servers that is at range of 9~11w, it always at 49w with nothing running on gpus.

# nvidia-smi
Tue Apr  3 21:06:08 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.125                Driver Version: 384.125                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:02:00.0 Off |                    0 |
| N/A   20C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:04:00.0 Off |                    0 |
| N/A   22C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:83:00.0 Off |                    0 |
| N/A   18C    P0    49W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   17C    P0    48W / 250W |      0MiB / 22912MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found it response slowly when typing dmesg, and having warning message saying

[  916.278598] nvidia 0000:02:00.0: irq 212 for MSI/MSI-X
[  916.280934] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.280970] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.280980] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.280990] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.280999] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.281009] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  916.281018] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[  917.071290] nvidia 0000:04:00.0: irq 213 for MSI/MSI-X
[  917.690861] nvidia 0000:83:00.0: irq 214 for MSI/MSI-X
[  918.330270] nvidia 0000:84:00.0: irq 215 for MSI/MSI-X

cuda tookit installed with yum.

Would anyone can help me? Thanks!

nvidia-bug-report.log.gz (280 KB)

Try using the persistence-daemon to keep the gpus initialized.

I found that the card workong at P0 with default setting with nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp                           : Wed Apr  4 10:20:43 2018
Driver Version                      : 384.125

Attached GPUs                       : 4
GPU 00000000:02:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324817049903
    GPU UUID                        : GPU-52ef73fe-e4db-81e4-f83a-7d869ab0930a
    Minor Number                    : 0
    VBIOS Version                   : 86.02.23.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2G610-0100-030
    Inforom Version
        Image Version               : G610.0200.00.03
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3810DE
        Bus Id                      : 00000000:02:00.0
        Sub System Id               : 0x11D910DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 22912 MiB
        Used                        : 0 MiB
        Free                        : 22912 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 21 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 92 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 49.24 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1303 MHz
        SM                          : 1303 MHz
        Memory                      : 3615 MHz
        Video                       : 1164 MHz
    Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Default Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Max Clocks
        Graphics                    : 1531 MHz
        SM                          : 1531 MHz
        Memory                      : 3615 MHz
        Video                       : 1379 MHz
    Max Customer Boost Clocks
        Graphics                    : 1531 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

then, I Enable persistence mode with nvidia-smi -pm 1 or nvidia-persistenced --persistence-mode and the server recovers

==============NVSMI LOG==============

Timestamp                           : Wed Apr  4 10:46:30 2018
Driver Version                      : 384.125

Attached GPUs                       : 4
GPU 00000000:02:00.0
    Product Name                    : Tesla P40
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0324817049903
    GPU UUID                        : GPU-52ef73fe-e4db-81e4-f83a-7d869ab0930a
    Minor Number                    : 0
    VBIOS Version                   : 86.02.23.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-2G610-0100-030
    Inforom Version
        Image Version               : G610.0200.00.03
        OEM Object                  : 1.1
        ECC Object                  : 4.1
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B3810DE
        Bus Id                      : 00000000:02:00.0
        Sub System Id               : 0x11D910DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 22912 MiB
        Used                        : 0 MiB
        Free                        : 22912 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 19 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 92 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 11.30 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 544 MHz
        SM                          : 544 MHz
        Memory                      : 405 MHz
        Video                       : 544 MHz
    Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Default Applications Clocks
        Graphics                    : 1303 MHz
        Memory                      : 3615 MHz
    Max Clocks
        Graphics                    : 1531 MHz
        SM                          : 1531 MHz
        Memory                      : 3615 MHz
        Video                       : 1379 MHz
    Max Customer Boost Clocks
        Graphics                    : 1531 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

I wonder how would this happened? I always use the default setting and it always works.

If you’re running nothing on the gpus and not use the persistence-daemon, the driver will de-initialize. If you’re then running nvidia-smi, it will initialize the gpus again which puts them to P0 state at start.
Using driver persistence, the gpus will stay initialized so when you run nvidia-smi, they stay idle.

This server is the only one that meets the problem of latency, others using cuda-8.0 with 384.66 or cuda-9.0 with 384.81 works well. As u can see that other machine without persistence mode work on P8 and the Idel state is active.

In addition, Before I installing cuda9.0 with 384.66, I installed the cuda9.1 using yum and found it cannot work with tensorflow1.7. Then I uninstall it using yum history undo. May this strange problem is associated with it?Do I need to flush the driver like xorg or other things mannually

I can only tell you that this is the expected behaviour and the reason for persistence mode/daemon. Not having logs from the other systems I can’t tell why those are behaving differently.
For TF and Cuda 9.1, minimum required driver version is 387 and you’ll have to build TF from source, the binary is built against Cuda 9.0.
Edit: build instructions https://medium.com/@xinh3ng/install-cuda-9-1-and-cudnn-7-for-tensorflow-1-5-0-cda36239bc68