I’ve been have trouble with CUDA on one of our machines for about a year, and forced to work in OpenCL using Agisoft PhotoScan/metashape. The troubles started when I upgraded from Win7 to Win10 and doubled the RAM to 512GB (Win7 only supports ~192GB). Platform is a Supermicro X10DRG-Q dual Xeon E5-2643 v3 w/ 512GB RAM, and three GPUs: dual EVGA 0980Ti hybrid GPUs (slot 1 and 2), and a EVGA Titan X hybrid running the display (slot 3).
Agisoft throws “CUDA_ERROR_UNKNOWN (999) at line 128” error. Periodically also I get “Warning: cudastreamdestroy failed: all CUDA-capable devices are busy or unavailable (46)”
Most recently (after doing a repair install/upgrade-in-place) I’ve been able to do alignment with only the display GPU disabled in Metashape, but I get the error on dense cloud regardless. This week I gave up on the old install after updating the BIOS and disabling IPMI on the motherboard, so I bit the bullet and did a clean reinstall (going from dual boot to EFI only), installed the latest NVidia drivers, and still got the same error.
I’m scratching my head here. I think it probably comes down to BIOS settings or something that changed with WDDM 2 in Win 10 (pretty much all the similar errors posted I’ve seen are on Win10). I’ve run NVidia-smi and I see all the Windows crap running on the GPU too, but not sure how to stop them since I don’t have a a non-CUDA gpu I can point them to. The mobo does have a built-in ASPEED VGA display adapter, and I haven’t jumper-disabled that, but deprioritizing it in the BIOS doesn’t seem to help. I have tried switching the active display to GPU 1 but I haven’t tried moving the cards around. Again, everything worked great in Win 7.
I suppose I could give up on Windows and install a linux OS to see if that fixes the problem, but I’d like to figure out what’s going on and I’m at a loss how to troubleshoot. Agisoft seems stumped too, and there are about a half-dozen people posting about this issue on their forums, so probably 5-10x the amount having the problem.
So now I’m reaching out to NVidia and devs (NVidia customer support chat rep sent me here) and talking w/ Supermicro too. I’m wondering if I need to investigate bios settings like IOAT, snoop, relaxed ordering, etc., and/or if I just need the magic combinations of drivers and registry settings in Windoze.
OS: Windows 10 Pro 1809 build 17763.253
GPU1 980ti ( no monitor)
GPU2 980ti (no monitor)
GPU3 Titan X (display)
(I have tried switching display cord to different cards, have not tried switching cards around or disabling jumper on inbuild video adapter - remember it all worked in Win7 with half the RAM)
512GB RAM (all slots) at 18xx MHz I think…
(there are more but not with any new info)