Quad (4x) A6000 WSL2 CUDA Init Errors

warren.schultz · August 30, 2023, 3:47pm

Same issue as the others.
Running latest on WSL2, Ubuntu 22.04, CUDA 12.2. 4x RTX 6000 Ada.
Using export CUDA_VISIBLE_DEVICES=0,1,2,3
Any combination of 1 and 3 fail.
0,1,2,3 fail
0,1,2 pass
0,1,3 fail
0,2,3 pass
1,2,3 fail
0,1 pass
0,2 pass
0,3 pass
1,2 pass
1,3 fail
2,3 pass

See related posts here:

github.com/microsoft/WSL

CUDA Initialization Issue and Proposed Workaround with PyTorch in wsl2

opened 02:45PM - 07 Jul 23 UTC

wzsxb233

GPU

### Windows Version Windows11 ### WSL Version 1.2.5.0 ### Are you using WSL …1 or WSL 2? - [X] WSL 2 - [ ] WSL 1 ### Kernel Version 5.15.90.1 ### Distro Version Ubuntu -default ### Other Software I am writing to report an unexpected behavior I’ve encountered when working with PyTorch and CUDA on a wsl2 on Windows 11 system equipped with multiple NVIDIA RTX 3090 GPUs. Environment Details: Operating System: Windows 11 CUDA Version: 12.2 WSL Version: 2 GPUs: 4x NVIDIA RTX 3090 PyTorch Version: 2.01 (CUDA 11.8) Problem Statement: When I set the CUDA_VISIBLE_DEVICES environment variable to enable all the GPUs (0,1,2,3) on the system and then run a PyTorch script that calls torch.cuda.is_available(), I encounter an “Out of Memory” error. Notably, this error does not occur if I only enable GPU 1 or a combination of GPU 0,2,3. Furthermore, this error can be circumvented if I call torch.cuda.device_count() before torch.cuda.is_available(). ### Repro Steps Steps to Reproduce: Set the environment variable: export CUDA_VISIBLE_DEVICES=0,1,2,3 Run a Python script that imports PyTorch and calls torch.cuda.is_available() ### Expected Behavior Expected Behavior: The torch.cuda.is_available() function should return True if GPUs are available and accessible. ### Actual Behavior Observed Behavior: An “Out of Memory” error is triggered internally at ../c10/cuda/CUDAFunctions.cpp:109. The torch.cuda.is_available() function returns False. Workaround: I found that calling torch.cuda.device_count() before torch.cuda.is_available() circumvents the error. However, this workaround requires modifying each script to include this extra call. While the workaround is effective, it may be beneficial to investigate and address the root cause of this issue. I wanted to bring this to your attention and look forward to any insights or potential solutions you might provide. ### Diagnostic Logs _No response_

github.com/microsoft/WSL

Running CUDA samples with multiple GPUs is failed

opened 06:33AM - 21 Dec 21 UTC

yes89929

### Version Microsoft Windows [Version 10.0.22000.376] ### WSL Version …- [X] WSL 2 - [ ] WSL 1 ### Kernel Version 5.10.60.1 ### Distro Version Ubuntu 20.04 and Ubuntu 18.04 ### Other Software CPU: Intel(R) Core(TM) i9-9900X GPU: Nvidia Titan RTX * 4 (driver 510.06) RAM: 128GB ### Repro Steps Install CUDA on WSL ``` sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub sudo sh -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /" > /etc/apt/sources.list.d/cuda.list' sudo apt-get update sudo apt-get install -y cuda-toolkit-11-0 ``` Run samples ``` cd /usr/local/cuda-11.0/samples/4_Finance/BlackScholes sudo make ./BlackScholes ``` ``` cd /usr/local/cuda-11.0/samples/1_Utilities/deviceQuery sudo make ./deviceQuery ``` ### Expected Behavior Return success ### Actual Behavior ``` [./BlackScholes] - Starting... CUDA error at ../../common/inc/helper_cuda.h:777 code=2(cudaErrorMemoryAllocation) "cudaGetDeviceCount(&device_count)" ``` ``` ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 2 -> out of memory Result = FAIL ``` ### Diagnostic Logs _No response_

I’ve also found an odd partial workaround, which indicates that it is an initialization issue of some sort at the driver level. It doesn’t necessarily work reliably, but it allows you to get the system GPU configurations.

Toggling GPU performance counters to unrestricted, applying the setting, and then switching back allows the GPU check functions to complete.