Quad (4x) A6000 WSL2 CUDA Init Errors

Hey there,

I’m running into some issues with WSL2 on a 4x A6000 Machine. I’ve tried both CUDA 11.7 and 12.0 samples and get “Error: only 0 Devices available, 1 requested. Exiting.” after successfully building and attempting to run the nbody Cuda sample.

$ echo $PATH returns the following successful cuda path:
/usr/local/cuda-12.0/bin

$ echo $CUDA_HOME returns the following:
/usr/local/cuda-12.0/

$ ldconfig -p returns the following list of cuda libraries having been loaded:

Nvidia-SMI returns the following:

…what exactly am I missing from this seemingly appropriate WSL2 setup? I did not install Nvidia display drivers, I am on a brand new WSL2 Ubuntu image running on the latest kernel from Microsoft. I am on Windows 11 22H2.

Drivers as stated in Nvidia-SMI output.

Trying to load Pytorch torch.cuda.is_available() returns the following errors:

  1. “UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error?”
  2. Error 2: out of memory (Triggered internally at /opt/pytorch/pytorch/c10/cuda/“CUDAFunctions.cpp:109”)

Any help here would be wildly appreciated. Thanks!

Update:

If I do a quick:
$export CUDA_VISIBLE_DEVICES=3 (or 1, or 2)

everything works fine. Moving up to the final GPU with $export CUDA_VISIBLE_DEVICES=4 causes a failure.

The same issue happens with WSL2 on my 4x A6000 machine, but to me any option of CUDA_VISIBLE_DEVICES including GPU 1 causes failures while 0, 2, 3 work fine. Still looking for solutions. :(

Thanks, finally make things work out by limited to 3 visible devices. I’m using 4x 2080Tis and also facing the same problem.

Given that this is now mildly repro’ed by 3 people, does any Nvidia team member have thoughts on whether this is a Microsoft / WSL or an Nvidia challenge? Or correctable user error perhaps?

Any news on this? I’m seeing a similar issue with 4x A6000’s. Works fine with three. Nvidia-smi shows all four available, but no joy unless limited by CUDA_VISIBLE_DEVICES or physically unplugging the fourth GPU. This is with driver version 536.25 and CUDA 12.2.

Possible clue: running dmesg -w while attempting to start up a container with four GPUS produces a whole pile of these errors:

misc dxg: dxgk: dxgkio_reserve_gpu_va: Ioctl failed: -75

I resolved this by setting SLI Configuration to “Activate All Displays” in the Nvidia control panel. Given that nvlink/sli was not working before this change, this doesn’t seem to have any downside as far as I can tell.

1 Like

Same issue as the others.
Running latest on WSL2, Ubuntu 22.04, CUDA 12.2. 4x RTX 6000 Ada.
Using export CUDA_VISIBLE_DEVICES=0,1,2,3
Any combination of 1 and 3 fail.
0,1,2,3 fail
0,1,2 pass
0,1,3 fail
0,2,3 pass
1,2,3 fail
0,1 pass
0,2 pass
0,3 pass
1,2 pass
1,3 fail
2,3 pass

See related posts here:

I’ve also found an odd partial workaround, which indicates that it is an initialization issue of some sort at the driver level. It doesn’t necessarily work reliably, but it allows you to get the system GPU configurations.

Toggling GPU performance counters to unrestricted, applying the setting, and then switching back allows the GPU check functions to complete.

These work arounds unfortunately did not work for me. I’ve resorted to just dual booting pure Ubuntu to be done with these shenanigans

1 Like

您好,我注意到该问题最早在2021年就已被发现,至今没有得到解决,我现在也面临着同样的问题,2080TI X 4,使用最新的驱动和cudatoolkit及cudnn也不行,也尝试了较低版本的驱动和CUDA,也没有解决问题,我想这么长时间过去了,官方不会重视了