Windows 10 WSL Ubuntu 20.04: Fortran MPI+OpenACC+DC GPU code not running


I am testing out Windows 10 WSL Ubuntu 20.04.

I am able to install the CUDA toolkit and run a sample code per the instructions at

and I am able to run nvidia-smi:

| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   47C    P0    45W / 200W |    404MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |

I am also able to install the NVIDIA HPC SDK compiler.

However, when I try to compile my Fortran MPI+OpenACC+DC code, I get:

mpif90 -O3 -march=native -acc=gpu -stdpar=gpu -gpu=cc86,cc61,nomanaged -Minfo=accel -c pchip_module_v1.0.0.f90 -o pchip_module.o
nvfortran-Error-CUDA 11.1 or later required
nvfortran-Error-A CUDA toolkit matching the current driver version (0) or a supported older version (11.0) was not installed with this HPC SDK.
make: *** [Makefile:62: pchip_module.o] Error 1

Since the Linux driver is not installed (as WSL uses the Windows driver), it seems the NVHPC is detecting driver “0”.

However, if I explicitly tell the compiler to use Cuda 12, the compilation DOES work:

mpif90 -O3 -march=native -acc=gpu -stdpar=gpu -gpu=cc86,cuda12.0,nomanaged -Minfo=accel -I/opt/psi/nv/ext_deps/deps/hdf4/include -I/opt/psi/nv/ext_deps/deps/hdf5/include -c mas_sed_expmac.f -o mas.o
... ... ...  
mpif90 -O3 -march=native -acc=gpu -stdpar=gpu -gpu=cc86,cuda12.0,nomanaged -Minfo=accel -I/opt/psi/nv/ext_deps/deps/hdf4/include -I/opt/psi/nv/ext_deps/deps/hdf5/include pchip_module.o mas.o -L/opt/psi/nv/ext_deps/deps/hdf4/lib -lmfhdf -ldf  -L/opt/psi/nv/ext_deps/deps/hdf5/lib -lhdf5_fortran -lhdf5hl_fortran -lhdf5 -lhdf5_hl -L/opt/psi/nv/ext_deps/deps/jpeg/lib -ljpeg -L/opt/psi/nv/ext_deps/deps/zlib/lib -lz -o mas

But, when I try to run my code - it silently fails at the first use of a GPU kernel:

 mpiexec -np 1 ../../../../branches/mas_acc/mas mas
...  ... ...
Current file:     /opt/psi/nv/mas/branches/mas_acc/mas_sed_expmac.f
        function: zero_avec
        line:     25732
This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[37099,1],0]
  Exit code:    1

Is there a way to make the NVHPC compile GPU codes that work on WSL?

– Ron

Hi Ron,

It’s likely that our runtime can’t find the CUDA driver ( Try setting the LD_LIBRARY_PATH in your environment to include “/usr/lib/wsl/lib” or where ever was installed.

Note, you can tell if this is the problem by running the ‘nvaccelinfo’ utility, rather than nvidia-smi, since it uses the same look-up as our runtime.

Example with my laptop Ubuntu on WSL2:

~$ nvaccelinfo -v not found
No accelerators found.
Check that you have installed the CUDA driver properly
Check that your LD_LIBRARY_PATH environment variable points to the CUDA runtime installation directory
~$ export LD_LIBRARY_PATH=/usr/lib/wsl/lib
~$ nvaccelinfo

CUDA Driver Version:           11070

Device Number:                 0
Device Name:                   NVIDIA T600 Laptop GPU
Device Revision Number:        7.5
Global Memory Size:            4294705152
Number of Multiprocessors:     14
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1395 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5001 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 1048576 bytes
Max Threads Per SMP:           1024
Async Engines:                 6
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     No
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
Default Target:                cc75
1 Like


This worked perfectly!


– Ron

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.