CUDA_ERROR_UNKNOWN after upgrading RAM from 256-512GB and Win7 -> Win10 (Pro)

I’ve been have trouble with CUDA on one of our machines for about a year, and forced to work in OpenCL using Agisoft PhotoScan/metashape. The troubles started when I upgraded from Win7 to Win10 and doubled the RAM to 512GB (Win7 only supports ~192GB). Platform is a Supermicro X10DRG-Q dual Xeon E5-2643 v3 w/ 512GB RAM, and three GPUs: dual EVGA 0980Ti hybrid GPUs (slot 1 and 2), and a EVGA Titan X hybrid running the display (slot 3).

Agisoft throws “CUDA_ERROR_UNKNOWN (999) at line 128” error. Periodically also I get “Warning: cudastreamdestroy failed: all CUDA-capable devices are busy or unavailable (46)”

Most recently (after doing a repair install/upgrade-in-place) I’ve been able to do alignment with only the display GPU disabled in Metashape, but I get the error on dense cloud regardless. This week I gave up on the old install after updating the BIOS and disabling IPMI on the motherboard, so I bit the bullet and did a clean reinstall (going from dual boot to EFI only), installed the latest NVidia drivers, and still got the same error.

I’m scratching my head here. I think it probably comes down to BIOS settings or something that changed with WDDM 2 in Win 10 (pretty much all the similar errors posted I’ve seen are on Win10). I’ve run NVidia-smi and I see all the Windows crap running on the GPU too, but not sure how to stop them since I don’t have a a non-CUDA gpu I can point them to. The mobo does have a built-in ASPEED VGA display adapter, and I haven’t jumper-disabled that, but deprioritizing it in the BIOS doesn’t seem to help. I have tried switching the active display to GPU 1 but I haven’t tried moving the cards around. Again, everything worked great in Win 7.

I suppose I could give up on Windows and install a linux OS to see if that fixes the problem, but I’d like to figure out what’s going on and I’m at a loss how to troubleshoot. Agisoft seems stumped too, and there are about a half-dozen people posting about this issue on their forums, so probably 5-10x the amount having the problem.

So now I’m reaching out to NVidia and devs (NVidia customer support chat rep sent me here) and talking w/ Supermicro too. I’m wondering if I need to investigate bios settings like IOAT, snoop, relaxed ordering, etc., and/or if I just need the magic combinations of drivers and registry settings in Windoze.

OS: Windows 10 Pro 1809 build 17763.253
mobo X10DRG-Q
GPU1 980ti ( no monitor)
GPU2 980ti (no monitor)
GPU3 Titan X (display)
(I have tried switching display cord to different cards, have not tried switching cards around or disabling jumper on inbuild video adapter - remember it all worked in Win7 with half the RAM)

512GB RAM (all slots) at 18xx MHz I think…

related threads here:
https://www.agisoft.com/forum/index.php?topic=10394.0
https://www.agisoft.com/forum/index.php?topic=8946.0
https://www.agisoft.com/forum/index.php?topic=8445.0

(there are more but not with any new info)

Cleaned up my BIOS config and reinstalled OS with all EFI boot (was dual boot), CSM disabled, and have been stepping through MMIOH Base and MMIO High Size settings (not sure if there’s a way to ID the “right” settings.

Currently with MMIOH Base = 40TB and MMIO High Size = 1024GB I can run CUDA on the two 980 Ti cards (slot 1, slot 2, no display) but if I enable slot 3 (Titan X) I get CUDA error 999.

(THis is just a quick test with Agisoft metashape, align mode)

using these as guidance:

https://www.servethehome.com/nvidia-smi-issues-get-nvidia-cuda-working-grid-tesla-gpus/

https://nvidia.custhelp.com/app/answers/detail/a_id/4119/~/incorrect-bios-settings-on-a-server-when-used-with-a-hypervisor-can-cause-mmio

https://forums.guru3d.com/threads/functional-4-way-sli-pascal-titan-x.409468/page-10

So quiet in here…

I’ll update.

In touch with NVidia level 2 tech support (email). They submitted Bug #200507769 against CUDA Driver. Progress will be painfully slow, I imagine. I think the bug report is only registered against driver ver 393.97 (my last known good driver before upgrade). But I have also tried against 419.35.

would appreciate any tips on running diags with nvidia-smi or anything really…