Nvidia A40 + Citrix Hypervisor 8.2 CU1 - Bluescreen

Hi all,
I have 8x HPE Proliant DL380 Gen10 Plus servers with latest BIOS. Each server has 3x Nvidia A40 cards. The cards are all in compute / Displayless mode for vGPU. I have the latest Nvidia drivers 510 / version 14.1. Hypervisor is Citrix with latest version 8.2.1 (CU1). Everything is supported regarding HCL lists. I have in BIOS “Virtualization - Max performance” which enables SR-IOV and MMIO. I can attach a vgpu profile to my VM (W10 Enterprise 21H2) and the issue ocours when I am installing corresponding Nvidia driver. The blue screen appears with VIDEO_TDR_FAILURE , nvlddmkm.sys.
I have tested a clean server 2019 as well. I think I have tried it all. Please help!!

±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06 Driver Version: 510.73.06 CUDA Version: N/A |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:2B:00.0 Off | Off |
| 0% 44C P8 35W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A40 On | 00000000:A2:00.0 Off | Off |
| 0% 44C P8 34W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A40 On | 00000000:C0:00.0 Off | Off |
| 0% 45C P8 37W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

==============NVSMI LOG==============

Timestamp : Wed Jun 1 13:58:39 2022
Driver Version : 510.73.06
CUDA Version : Not Found

Attached GPUs : 3
GPU 00000000:2B:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1565221004370
GPU UUID : GPU-05bde6b5-ed98-02a1-5ca3-b0f29aa9713a
Minor Number : 0
VBIOS Version : 94.02.5C.00.03
MultiGPU Board : No
Board ID : 0x2b00
GPU Part Number : 900-2G133-0300-030
Module ID : 0
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : SR-IOV
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x2B
Device : 0x00
Domain : 0x0000
Device Id : 0x223510DE
Bus Id : 00000000:2B:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 452 MiB
Used : 0 MiB
Free : 48687 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 44 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 35.51 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 706.250 mV
Processes : None

@sschaber , you always have great tips :)

Have you also tried an older guest driver from the vGPU 13.x branch? Just to rule out the latest guest driver? I’m not aware of any given issues with the vGPU 14.1 driver but I didn’t test it myself yet.

regards
Simon

Thank you for the reply. I have tested 13.3 both host and guest. There is no other settings etc you are aware of ? could try an earlier to…

@sschaber I tried 13.0 now and it made no difference… I am out of ideas. Same thing when passing through the whole GPU …

Which profile did you assign during the driver installation?

A40-8Q… For CAD machines.

Sounds to be more related to GPOs or Windows OS. Is the VM joined to a domain?
Had once a case that a security setting in a GPU stopped our drivers to install properly

Yes, Provisioned citrix W10 machine. Also to rula that out installed a “local” 2019 server with no domain. Same issue.

Hi, another idea would to modify the TDRDelay settings to see if it installed when having more time (10sec instead of 2sec)

Perfect, I will try that next.

Made no difference. Any other tip ?

@sschaber here is the daemon log on the Hypervisor :

Jun 2 09:46:21 xen01 vgpu-9[13537]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Jun 2 09:46:21 xen01 vgpu-9[13537]: notice: vmiop_log: Driver Version: 512.78
Jun 2 09:46:21 xen01 vgpu-9[13537]: notice: vmiop_log: vGPU version: 0xd0001
Jun 2 09:46:21 xen01 vgpu-9[13537]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jun 2 09:46:21 xen01 vgpu-9[13537]: notice: vmiop_log: (0x0): Timeout detection and recovery (TDR) completed.
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): Timeout occurred, reset initiated.
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x52445456 0x004403f8 0x000001cc 0x00000001
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00989680 0x00000000 0x000001bb 0x0000000f
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000100 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001600 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00001a00 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001f00 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000b00 0x00000000 0x0000000a 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000c00 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000a00 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001300 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00002100 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001700 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00002400 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00001800 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000e00 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000f00 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00001000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x000001aa 0x0000000b
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x0000000a 0x0000000a 0x0000000a 0x00020b01
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00005188 0x00000000 0x1b43a113 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x1d0f9bb1 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000009 0x00000009
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000009 0x00020b01 0x000000dc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x1b439cf5 0x00000000 0x1d0f9bb1 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000008 0x00000008 0x00000008 0x00020b00
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00005188 0x00000000 0x1813f8f1 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x1b12ad84 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000007 0x00000007
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000007 0x00020b00 0x000000dc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x1813f4ee 0x00000000 0x1b12ad83 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000006 0x00000006 0x00000006 0x00020b00
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00005188 0x00000000 0x14e487a5 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x17e3169c 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000005 0x00000005
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000005 0x00020b00 0x000000dc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x14e483a6 0x00000000 0x17e3169c 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000004 0x00000004 0x00000004 0x00020b00
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00005188 0x00000000 0x11a832a9 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x14b307b7 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000003 0x00000003
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000003 0x00020b00 0x000000dc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x11a82e88 0x00000000 0x14b307b3 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000002 0x00000002 0x00000002 0x00020b00
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00005188 0x00000000 0x0e4b16ea 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x1176a3be 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000001 0x00000001
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000001 0x00020b00 0x000000dc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x0e4b1194 0x00000000 0x1176a3bc 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000b00
Jun 2 09:46:26 xen01 vgpu-9[13537]: error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000
Jun 2 09:46:26 xen01 vgpu-9[13537]: message repeated 2 times: [ error: vmiop_log: (0x0): TDR_DUMP:0x00000000 0x00000000 0x00000000 0x00000000]

Unfortunately I’m running out of ideas. Did you try to run a Win10 guest already to see if this would work? But I assume there must be something else going wrong with the hypervisor.

I guess it is Citrix Hypervisor as well. Going to create a ticket. Thank you for your help!

Hi @sschaber , Citrix working on it. This is what Nvidia /driver on the host says.
Anything you seen before ?

Jun 20 09:41:31 xen01 vgpu-2[2281]: __mapcache_fault: 6f7a
Jun 20 09:41:31 xen01 vgpu-2[2281]: demu_register_msi_pirq: Error mapping MSI-X entry: 0, Invalid argument
Jun 20 09:41:31 xen01 vgpu-2[2281]: error: vmiop_log: (0x0): failed to register msi pirq
Jun 20 09:41:31 xen01 vgpu-2[2281]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
Jun 20 09:41:31 xen01 vgpu-2[2281]: notice: vmiop_log: Driver Version: 512.78
Jun 20 09:41:31 xen01 vgpu-2[2281]: notice: vmiop_log: vGPU version: 0xd0001
Jun 20 09:41:31 xen01 vgpu-2[2281]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)
Jun 20 09:41:31 xen01 qemu-dm-2[2316]: 2316@1655710891.463647:xen_platform_log xen platform: xen|ModuleAdd: FFFFF801511E0000 - FFFFF801511FAFFF [monitor.sys]
Jun 20 09:41:36 xen01 vgpu-2[2281]: error: vmiop_log: (0x0): Timeout occurred, reset initiated.
Jun 20 09:41:36 xen01 vgpu-2[2281]: error: vmiop_log: (0x0): TDR_DUMP:0x52445456 0x00910238 0x000001cc 0x00000001

Just wanted to share the solution. It was a setting in dom0 of Citrix Hypervisor.
/opt/xensource/libexec/xen-cmdline --set-xen x2apic_phys=true

x2apic was enabled i BIOS all the time but not in the Hypervisor ( by default ).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.