Now Available: CUDA on WSL Public Preview

As you get started with CUDA on WSL, please review our documentation and resources available at the bottom our CUDA on WSL page here: https://nvda.ws/3hsWiMt

1 Like

I’ve set everything up as documented and built the cuda samples for CUDA 11 in the WSL container. The devicequery works correctly but all other samples simply freeze when I run them.

I installed cuda using the deb files provided for the CUDA 11 RC from the nvidia website.

Furthermore, when I try and run the docker containers I get

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

This is using the WSL docker integration from docker 2.3.1.0.

devicequery output:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2060 SUPER"
  CUDA Driver Version / Runtime Version          11.1 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (34) Multiprocessors, ( 64) CUDA Cores/MP:     2176 CUDA Cores
  GPU Max Clock rate:                            1695 MHz (1.70 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

Which samples particularly did you try?

For the Docker error - did you install the libnvidia-container 1.2.0-rc.1 (you can check using “sudo apt-cache policy libnvidia-container1” ). Also - did you stop and start the docker service as indicated in the user guide?

On the top of P_Ramarao’s question it would also be helpful if you could run dxdiag tool on your Windows10 PC, save all information (“Save All Information…” button) there into a file and attach the text file to this thread.

System Information

Dxdiag output:
DxDiag.txt (92.6 KB)

WSL version

PS C:\Users\psnape.PATRICK-WIN-PC> wsl cat /proc/version
Linux version 4.19.121-microsoft-WSL2-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Thu May 14 20:25:24 UTC 2020

Ubuntu is 18.04.

CUDA was installed from https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=debnetwork

Docker

For Docker I was using the docker integration from docker in Windows and then installed the nvidia container following the instructions from the WSL user guide:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2

CUDA Samples

For the CUDA samples I cloned the samples from https://github.com/NVIDIA/cuda-samples and built them from master (no errors).

I then successfully ran devicequery but all other samples I tried just hang (they never progress past the output given below after 5 minutes of waiting for each):

(base) ~/cuda-samples/bin/x86_64/linux/release [master]$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
(base) ~/cuda-samples/bin/x86_64/linux/release [master]$ ./MersenneTwisterGP11213
./MersenneTwisterGP11213 Starting...

GPU Device 0: "Turing" with compute capability 7.5

Allocating data for 2400000 samples...
Seeding with 777 ...
(base) ~/cuda-samples/bin/x86_64/linux/release [master]$ ./simpleCUBLAS
GPU Device 0: "Turing" with compute capability 7.5

simpleCUBLAS test running..

Also, running cuda-gdb on the matrixMul sample shows it hanging in the API initialisation:

(base) ~/cuda-samples/bin/x86_64/linux/release [master *]$ cuda-gdb matrixMul
NVIDIA (R) CUDA Debugger
9.1 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from matrixMul...done.
(cuda-gdb) run
Starting program: /home/psnape/cuda-samples/bin/x86_64/linux/release/matrixMul
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
[New Thread 0x7fffef8b0700 (LWP 1933)]
[New Thread 0x7fffef0af700 (LWP 1934)]
^C
Thread 1 "matrixMul" received signal SIGINT, Interrupt.
0x00007ffff5db3860 in cudbgApiInit () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
(cuda-gdb) bt
#0  0x00007ffff5db3860 in cudbgApiInit () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#1  0x00007ffff5db3ce3 in cudbgApiInit () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#2  0x00007ffff5ccc938 in cudbgGetAPI () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#3  0x00007ffff5b3bbf7 in ?? () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#4  0x00007ffff5bb34a7 in cuEGLApiInit () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#5  0x00007ffff5b7c72f in ?? () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#6  0x00007ffff5c3d1cc in cuDevicePrimaryCtxRetain () from /usr/lib/wsl/drivers/nv_dispi.inf_amd64_46ae561a4ce7ec27/libcuda.so.1.1
#7  0x000055555557edd2 in cudart::contextStateManager::initPrimaryContext(cudart::device*) ()
#8  0x000055555557ef97 in cudart::contextStateManager::initDriverContext() ()
#9  0x0000555555582918 in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#10 0x000055555557487c in cudart::doLazyInitContextState() ()
#11 0x0000555555561c88 in cudart::cudaApiMalloc(void**, unsigned long) ()
#12 0x0000555555591433 in cudaMalloc ()
#13 0x000055555555b280 in MatrixMultiply (argc=1, argv=0x7fffffffe028, block_size=32, dimsA=..., dimsB=...) at matrixMul.cu:168
#14 0x000055555555bd68 in main (argc=1, argv=0x7fffffffe028) at matrixMul.cu:345
(cuda-gdb)

Got everything working (cuda samples, cuda-gbd, docker examples…) by doing this:

sudo apt-get --purge remove nvidia-driver-450
sudo apt-get install nvidia-driver-450

Dont get it

The docs for the installation say not to install the driver:

Download the NVIDIA Driver from the download section on the CUDA on WSL page. Choose the appropriate driver depending on the type of NVIDIA GPU in your system - GeForce and Quadro.

Install the driver using the executable. This is the only driver you need to install.

Note:
Do not install any Linux display driver in WSL. The Windows Display Driver will install both the regular driver components for native Windows and for WSL support.

Exactly that’s why Im confused, I did it because you can always reinstall the distro all together…

That doesn’t seem to help me unfortunately :( All the samples still hang

Ok I’m just going to leave this here:
DxDiag.txt (100.5 KB)

I guess the biggest difference is my Intel processor

Thank you for posting the debug details and dxdiag. At this point I believe you are running into the issue we recently documented in the User Guide and which is currently under investigation: Known Limitations with CUDA on WSL 2.

Correct. You do NOT need to install any version of the driver within the WSL2 container. You only have to have the proper (display) driver installed on the Windows host.

Thank you for posting the dxdiag. I believe CUDA should work fine in WSL2 on your system.
Was their any problem you noticed when you followed the installation steps from User Guide? Did any workload not work that you have to try to re-install the linux driver (which was not the right step, btw) ?

No problem, and btw going through the cuda-samples I can see which ones failing due to the known limitations.

While following the User Guide I made the mistake of starting with WLS version 1 and midway made the swich to WLS 2 using wsl --set-version command. I was mainly testing the docker examples, I was having trouble keeping the daemon up.
Sorry I can’t really point out a specific issue, maybe the version switch cause some corner case. This was my first time using docker too!

GPU will not be there in the WSL container with WSL 1. The container must be switched to WSL 2 mode in order to get CUDA to work there.

We believe we have root caused the issue and have a fix that is going through some internal testing. We hope to be able to publish a patched driver after it passes required validation.
Stay tuned and thanks again for reporting the issue in this forum.

2 Likes

Fantastic - thanks for the quick turn around - really excited to try this out. Let me know if there’s any third party testing I can help with.

We have just released and published and updated driver with a fix that we think may be addressing you issue. Please give it a try when you have a chance: https://developer.nvidia.com/cuda/wsl/download. It would be much appreciated if you could confirm whether the issue is fixed on your side or not. And thanks again for helping us to make CUDA work great in WSL 2 !

1 Like
(base) ~/cuda-samples/bin/x86_64/linux/release [master *]$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Turing" with compute capability 7.5

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 75.16 GFlop/s, Time= 1.744 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

Fantastic! Now let me try Tensorflow in the docker container…