Hi all,
I have 8x HPE Proliant DL380 Gen10 Plus servers with latest BIOS. Each server has 3x Nvidia A40 cards. The cards are all in compute / Displayless mode for vGPU. I have the latest Nvidia drivers 510 / version 14.1. Hypervisor is Citrix with latest version 8.2.1 (CU1). Everything is supported regarding HCL lists. I have in BIOS “Virtualization - Max performance” which enables SR-IOV and MMIO. I can attach a vgpu profile to my VM (W10 Enterprise 21H2) and the issue ocours when I am installing corresponding Nvidia driver. The blue screen appears with VIDEO_TDR_FAILURE , nvlddmkm.sys.
I have tested a clean server 2019 as well. I think I have tried it all. Please help!!
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06 Driver Version: 510.73.06 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:2B:00.0 Off | Off |
| 0% 44C P8 35W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A40 On | 00000000:A2:00.0 Off | Off |
| 0% 44C P8 34W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A40 On | 00000000:C0:00.0 Off | Off |
| 0% 45C P8 37W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
==============NVSMI LOG==============
Timestamp : Wed Jun 1 13:58:39 2022
Driver Version : 510.73.06
CUDA Version : Not Found
Attached GPUs : 3
GPU 00000000:2B:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1565221004370
GPU UUID : GPU-05bde6b5-ed98-02a1-5ca3-b0f29aa9713a
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x2b00
GPU Part Number : 900-2G133-0300-030
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x2B
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:2B:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 452 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 44 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 35.51 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None