Hi all,
I have been struggling with random system lockups (display freezes, pointer will not move basically) when trying to use dual 1080s in this configration: ASrock x99e-itx + i6800K + 2x GTX1080s connected to a single x16 PCIe slot via an x8x8 splitter and using PCIe bifurcation.
Under Linux (4.8 kernel, 375.26 drivers, CUDA 8) one example to reliably reproduce the issue is running cuda/samples/1_Utilities/p2pBandwidthLatencyTest. Without vt-d it returns normally (with it enabled I have to power cycle as it hangs right after it prints the 1st matrix below):
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 225.01 5.75
1 5.78 237.17
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 237.90 4.47
1 5.19 237.17
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 234.89 10.19
1 10.30 238.17
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 234.96 8.73
1 8.77 237.68
P2P=Disabled Latency Matrix (us)
D\D 0 1
0 2.90 12.95
1 12.41 2.51
P2P=Enabled Latency Matrix (us)
D\D 0 1
0 2.63 6.05
1 5.65 2.56
Needless to say I have tried every combination of OS/UEFI/Drivers version to no avail. Given the system works solidly without IO virtualization, where should I start looking in case anyone has come across this before? I would like to make sure this is not nvidia driver related in any way before I blame it on buggy UEFI or other linux kernel related issues.
As an aside, with the exact same system, I am completely unable to use Win10 with 2 cards enabled (any version, any drivers, any setting for vt-d). I managed to get the ASRock UEFI guys in Taiwan to reproduce this, and they noticed that although dual 1080s froze in Win10 as described above, dual 980s worked just fine, so they blamed the drivers. I filed a regular technical support request with NVidia but after a lot of back and forth the answer was sorry: we don’t support custom risers…
So long story short: Under Linux everything works as long as I stay away from IO virt, and under Win10 total no go. Maybe the vt-d thing is a red herring, but it feels like some sort of IRQ mixup going on, and I am hoping that if the Linux problem is isolated with the help of the community then maybe it will prove useful for diagnosing the issue with Win10 too …
If anyone is willing to help I can of course provide much more technical detail, kernel dmesg’s, things I’ve tried etc [url][/url]
Thanks