Weird display driver lockup with 2x GTX1080 and intel vt-d enabled

Hi all,

I have been struggling with random system lockups (display freezes, pointer will not move basically) when trying to use dual 1080s in this configration: ASrock x99e-itx + i6800K + 2x GTX1080s connected to a single x16 PCIe slot via an x8x8 splitter and using PCIe bifurcation.

Under Linux (4.8 kernel, 375.26 drivers, CUDA 8) one example to reliably reproduce the issue is running cuda/samples/1_Utilities/p2pBandwidthLatencyTest. Without vt-d it returns normally (with it enabled I have to power cycle as it hangs right after it prints the 1st matrix below):

P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 225.01 5.75
1 5.78 237.17
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 237.90 4.47
1 5.19 237.17
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 234.89 10.19
1 10.30 238.17
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 234.96 8.73
1 8.77 237.68
P2P=Disabled Latency Matrix (us)
D\D 0 1
0 2.90 12.95
1 12.41 2.51
P2P=Enabled Latency Matrix (us)
D\D 0 1
0 2.63 6.05
1 5.65 2.56

Needless to say I have tried every combination of OS/UEFI/Drivers version to no avail. Given the system works solidly without IO virtualization, where should I start looking in case anyone has come across this before? I would like to make sure this is not nvidia driver related in any way before I blame it on buggy UEFI or other linux kernel related issues.

As an aside, with the exact same system, I am completely unable to use Win10 with 2 cards enabled (any version, any drivers, any setting for vt-d). I managed to get the ASRock UEFI guys in Taiwan to reproduce this, and they noticed that although dual 1080s froze in Win10 as described above, dual 980s worked just fine, so they blamed the drivers. I filed a regular technical support request with NVidia but after a lot of back and forth the answer was sorry: we don’t support custom risers…

So long story short: Under Linux everything works as long as I stay away from IO virt, and under Win10 total no go. Maybe the vt-d thing is a red herring, but it feels like some sort of IRQ mixup going on, and I am hoping that if the Linux problem is isolated with the help of the community then maybe it will prove useful for diagnosing the issue with Win10 too …

If anyone is willing to help I can of course provide much more technical detail, kernel dmesg’s, things I’ve tried etc

Thanks

It’s possible your issue is with the riser, or the motherboard not supporting the specific configuration you are attempting to do with the riser. I know it might not be the answer that you’re looking for, but for such a unique configuration, it might be better to instead go with a different motherboard that provides 2 slots where a riser is not needed.

I am unable to bring the bits and pieces (riser card, Win 10, VT-D, 1080 vs 980) into a coherent mental model that would explain your observations.

Use of a riser card certainly raises suspicions, not only in terms of handling the signal logic (e.g. the IRQs you mentioned), but also from a purely electrical signal integrity standpoint. Any intermediate socket / slot will add electrical load, and there may also be mechanical strain on the connectors depending on how the cards are installed.

The answer you received from NVIDIA simply reflects a normal bug reporting process, the first step of which is reproducing a reported issue in-house. If you have a very exotic setup that cannot be replicated in-house, it is possible that the bug report cannot be addressed in a meaningful way, and it appears this happened here.

Like vacaloca, I would suggest using a more standard hardware setup.

Thanks to both for coming back to this - your points re: the non-standard setup are well taken, and ultimately that may be the only pragmatic solution indeed. If I had to say two things: Regarding electrical noise and mechanical strain, I have stress-tested the setup (with vt-d disabled) for more than 24 hours by running caffe and folding@home in parallel on both GPUs at 100% utilization without any issues at all.
Which is why I am much more inclined to blame either misbehaving drivers or most likely misbehaving m/b+UEFI combination. I am following up with Asrock in any case and should anything come out of it, I will drop a line here to keep everyone posted.