Hi , I have been developing Fortan codes to solve hydrodynamic problems (potential flow around floating objects) for a number of years. For the last few years I have been using the windows version of the PGI 19.10 compiler in order to run the codes on CPU + GPU systems. The codes are mostly run on gtx 1080 ti. I have been very pleased with the performance of the codes on GPU and have received a lot of help from PGI and Nvidia in getting the codes compiled.
Recently , I updated my desktop and had a rtx 4090 unit fitted. It seems that I now have a problem because I am not able to run my existing codes on the rtx 4090 . On checking the availability of a suitable GPU, the programs immediately switch to CPU on start-up and will not use the 4090.
I have not been able to find any information why this is so and what the solution is. I know that NVIDIA has up to now not updated the windows version of the compilers. I have invested a huge amounts of time to develop these codes and fear that I have made the investment in the new GPU for nothing. Which path is now open to me ?
jo_rotter
Hi Jo_rotter,
You might try compiling with â-tp=nollvmâ so a PTX version of the device code is generated which in turn gets JIT compiled on the 4090. I havenât tried this myself so I canât be sure, but worth a try.
The second option is to move to Linux and install the latest version of the NVHPC compilers. While we donât officially support it, Iâve had several users have success with using WSL2 so you can run Linux directly on your Windows systems.
See the following guide on setting up WSL2 with CUDA support: NVIDIA GPU Accelerated Computing on WSL 2
Note, when running your code under WSL2, be sure to set LD_LIBRARY_PATH in your environment to include â/usr/lib/wsl/libâ or where ever libcuda.so was installed. Otherwise the runtime canât find the CUDA driver.
-Mat
Hi Mat,
We followed up your suggestion to use WSL2 on Windows 11. CUDA 12.1 was already installed and Ubuntu was installed through WSL 2 . Following that the latest compiler 23.3 was installed . Using the guide âNVIDIA GPU Accelerated Computing on WSL 2â the NVIDIA Linux GPU driver was downloaded and installed following the steps on the CUDA dowload page for WSL-Ubuntu. Finally, LD_LIBRARY_PATH was set to include â/usr/lib/wsl/libâ. We checked that libcuda.so was installed at that location.
We took all the files of the code and successfully compiled and linked the program.
We used the following âbuild-delfrac.shâ file (leaving out the 70 odd *.for files which make up the program) :
#!/bin/bash
O_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/math_libs/12.0/targets/x86_64-linux/lib/stubs
LAPACK_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/lib
cd obj
nvfortran -c -cuda -Minfo=ftn -cudalib=cusolver -fortranlibs -acc=gpu -gpu=ccnative -target=gpu -cudaforlibs âŠ/src/(70 files)
nvfortran -c -g -Mbackslash -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Minform=warn -Mvect=levels:8 -Mlarge_arrays -Mcudalib=cusolver âŠ/src/(70 files)
echo âCompilingâŠâ
nvfortran -c -Mextend -traceback -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Mcudalib=cusolver âŠ/src/(70 files)
nvfortran -c -gopt -C -traceback -Mbounds -Mchkptr -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Mcudalib=cusolver âŠ/src/(70 files)
echo âLinkingâŠâ
nvfortran -acc -Mcuda -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mvect=levels:8 -Mlarge_arrays -Mcudalib=cusolver,cublas -llapack -lblas -fortranlibs -cudaforlibs -o âŠ/bin/delfrac (70 *.o files)
.@@@
The program was run using the rtx 4090.
The program ran except it was extremely slow ! Using the GPUShark tool it showed that the 4090 core usage was continuously at 100% load all the time using only 18% TDP
When I say slow I mean each step took about 40 s while my previous GPU (GTX 1080 TI) was only taking about 5 s for each step ! The answer were , however , correct !
Can you shed some light on this ?
Best regards,
Jo
Hi Jo,
Given the large performance difference, my first guess would be that the application is running multicore CPU and not on the GPU. The GPU utilization could be due to the used of cuSolver.
If you run the command ânvaccelinfoâ, is the runtime able to find the GPU?
Can you run a profile, i.e. ânsys profile <my_app>â, and then review the report via ânsys stats report1.nsys-repâ, to ensure all the GPU kernels are being offloaded.
If it is running on the GPU, where is the application spending most of itâs time? In a particular kernel? Data movement? Poor across all kernels?
-Mat
Hi Mat,
Looks like Bingo for the first assumption !
johannesp@DESKTOP-45G6M31:~$ nvaccelinfo
No accelerators found.
Try nvaccelinfo -v for more information
johannesp@DESKTOP-45G6M31:~$ nvaccelinfo -v
libcuda.so not found
No accelerators found.
Check that you have installed the CUDA driver properly
Check that your LD_LIBRARY_PATH environment variable points to the CUDA runtime installation directory
johannesp@DESKTOP-45G6M31:~$
Now the solution ? Which sequence to install CUDA driver properly ? We thought we had set LD_LIBRARY_PATH properly but apparently not.
Jo
Letâs assume the driver is installed properly per the doc I linked to above.
Where is âlibcuda.soâ installed? Is it in â/usr/lib/wsl/libâ or some place else?
Next verify that LD_LIBRARY_PATH includes this directory.
Is the application getting invoked via a shell or other script? If so, is the environment getting inherited?
Hi Mat,
Re-installed LD_LIBRARY_PATH
using "export LD_LIBRARY_PATH=/usr/lib/wsl/lib
ran nvaccelinfo -v :
johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$ nvaccelinfo -v
CUDA Driver Version: 12010
Device Number: 0
Device Name: NVIDIA GeForce RTX 4090
Device Revision Number: 8.9
Global Memory Size: 25756696576
Number of Multiprocessors: 128
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 2550 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 10501 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 75497472 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
Preemption Supported: Yes
Cooperative Launch: Yes
Default Target: cc86
johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$
GPU present , no error regarding libcuda.so
The LD_LIBRARY_PATH Setting gets lost on closing Linux session
Program is run from Ubuntu command window : johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$ ./delfrac
Started a run : Same answers as before: Deadly slow but GPU at full blast.
Try adding the LD_LIBRARY_PATH to your .bashrc file so it gets set implicitly.
Are you are to run a profile with Nsight-Systems (nsys)?
Hi Mat,
I tried adding âexport LD_LIBRARY_PATH=/usr/lib/wsl/libâ to bash.bashrc
Made no difference to performance.
While looking for bash.bashrc I noticed there were more than one version. I added the path to
a version which had more 'export â lines
I ran nsys profile .
That produced two files , report1.nsys-rep and report1.sqlite. Both attached here.
Jo
report1.nsys-rep (312 KB)
report1.sqlite (804 KB)
The profile confirms that the code is not running on the GPU.
The typically reason for this is due to the runtime not being able to find the CUDA driver (libcuda.so), but may be failing due to other reasons.
Letâs try to narrow this down by compiling and running one of the example OpenACC codes that ship with the compilers. Hereâs what I did on my laptopâs WSL2 Ubuntu install:
mcolgrove@NV-JZG9LG3:~/tmp$ cp /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/examples/OpenACC/samples/acc_f1/acc_f1.f90 .
mcolgrove@NV-JZG9LG3:~/tmp$ nvfortran -acc -Minfo=accel acc_f1.f90
main:
28, Generating implicit copyin(a(1:n)) [if not already present]
Generating implicit copyout(r(1:n)) [if not already present]
29, Loop is parallelizable
Generating NVIDIA GPU code
29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
mcolgrove@NV-JZG9LG3:~/tmp$ export NV_ACC_NOTIFY=1
mcolgrove@NV-JZG9LG3:~/tmp$ a.out
launch CUDA kernel file=/home/mcolgrove/tmp/acc_f1.f90 function=main line=29 device=0 threadid=1 num_gangs=782 num_workers=1 vector_length=128 grid=782 block=128
100000 iterations completed
Test PASSED
âNV_ACC_NOTIFY=1â has the OpenACC runtime print a line each time a kernel is launched so we can tell if the code actually ran on the GPU.
If this works for you as well, then we can look whatâs different about your applications. If this doesnât work, then the issue is more likely system related, such as the CUDA driver.
-Mat
It did just the same :
johannesp@DESKTOP-45G6M31:~$ cp /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/examples/OpenACC/samples/acc_f1/acc_f1.f90 .
johannesp@DESKTOP-45G6M31:~$ nvfortran -acc -Minfo=accel acc_f1.f90
main:
28, Generating implicit copyin(a(1:n)) [if not already present]
Generating implicit copyout(r(1:n)) [if not already present]
29, Loop is parallelizable
Generating NVIDIA GPU code
29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
johannesp@DESKTOP-45G6M31:~$ export NV_ACC_NOTIFY=1
johannesp@DESKTOP-45G6M31:~$ ./a.out
launch CUDA kernel file=/home/johannesp/acc_f1.f90 function=main line=29 device=0 threadid=1 num_gangs=782 num_workers=1 vector_length=128 grid=782 block=128
100000 iterations completed
Test PASSED
johannesp@DESKTOP-45G6M31:~$
Jo
Ok, so itâs something to do with your program.
Which model(s) are you using, OpenACC, CUDA Fortran, Fortran DO CONNCURRENT?
I seem to recall it being OpenACC with calls to cuSolver.
The flags youâre using enables all three, so we should turn off those youâre not using.
Also, letâs remove âmulticoreâ and explicitly set âcc86â rather than âccallâ, or remove â-gpu=ccallâ flag altogether and have the compiler auto-detect the gpu.
i.e. something like:
nvfortran -c -gopt -acc=gpu -cuda -cudalib=cusolver
(note that weâre dropping the â-Mâ from â-cudaâ and â-cudalibâ)
Finally, double check that youâre not setting âACC_DEVICE_TYPE=HOSTâ in your environment. Assuming youâre using OpenACC, keep âNV_ACC_NOTIFY=1â set so you can check if the kernels are being launched.
-Mat
I modified the flags as you suggested , removing âmulticoreâ , left the compiler to autodetect the gpu and dropped M from â-cudaâ and â-cudalibâ
I kept âNV_ACC_NOTIFY=1â and ran the example again. The output is attached. The results are for 1 frequency step of 100 in total.
Total time for 1 frequency step was about 40 s while computations on a laptop with gtx 1080 (not 1080 TI) cost about 2 s for one frequency for exactly the
same case. That code was compiled under windows 10 using PGI 19.10 and the same fortran source codes as now being used for the unix case on the rtx 4090.
The output for the unix case shows calls to the following routines which use the gpu :
- Ult_inf_channel_b.for
- Multrlid.for
- FALTMICH.FOR
- Velocit_C_min7.FOR
As it stands, most time is spent in Multrlid.for between line no 58 and line no 118.
The calls to cusolver are in that interval. See attached Multrlid.for (with some superfluous comments)
Ult_inf_channel_b.for is a routine you pretty much optimized for me a few years ago
Also attached a screen shot showing the output and the data on the GPU in GPUShark.
Jo
Multrlid.for (4.06 KB)
Test_case.txt (6.46 KB)
Ok, so the good news is that it is actually running, but I have no idea why itâs slower.
The kernels in Multigrid.for are very small and run with few threads (68 grids x 128 threads per grid), so it doesnât quite make sense that these would be slow.
Iâm wondering if the slow-down is due to data movement or possibly the cuSolver calls?
Can you try getting another Nsight-system profile? Hopefully now that we can confirm that itâs running on the GPU, we can hope nsys can capture the kernel times.
Try running ânsys profile -o run1 â, and then ânsys stats run1.nsys-rep > run1.txtâ. Then post run1.txt.
If it still has issues generating a profile, it could be that it canât find libcupti.so, which is the device profiler library. Try finding it on your system, and set LD_LIBRARY_PATH to include this directory.
As a fall back, you can try setting âNV_ACC_TIME=1â to get a basic profile from our runtime. If it only prints the host times (elapsed) and not the device time, then it canât find libcupti.so.
Ran "nsys profile -o run1 program " followed by "nsys stats run1.nsys-rep?>run1.txt
Run1.txt is attached
Jo
run1.txt (2.4 KB)
Hmm, still nothing in the profile.
Support for WSL2 was added to Nsight-Systems 2022.4, and you should have an up to date version installed. My only guess is that libcupti.so isnât being found.
Did you added the directory where itâs located to youâre LD_LIBRARY_PATH? Something like: â/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/â
Did you try running the OpenACC profiler by setting NV_ACC_TIME=1 in your environment? If it canât find libcupti, it can still profile from the host side.
I found libcupti.so more or less by chance and then added the location as follows :
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64
To check the location I then did the following:
johannesp@DESKTOP-45G6M31:~$ ls -l /usr/local/cuda-12.1/lib64/
and then got a list of files at that location among them :
lrwxrwxrwx 1 root root 14 Apr 4 03:05 libcupti.so â libcupti.so.12
lrwxrwxrwx 1 root root 20 Apr 4 03:05 libcupti.so.12 â libcupti.so.2023.1.1
-rw-râr-- 1 root root 7419504 Apr 4 03:05 libcupti.so.2023.1.1
-rw-râr-- 1 root root 18490376 Apr 4 03:05 libcupti_static.a
I guess the address has been added to LD_LIBRARY_PATH or do have to do that in another way?
Jo
I also did a run preceded by
NV_ACC_TIME=1
but got no timing data at all
Jo
I used your address to see if it would work for me ; it did !
johannesp@DESKTOP-45G6M31:~$ ls -l /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/
Resulted in a list of files including a series of âlibcupti.soâ versions
I added the address using :
johannesp@DESKTOP-45G6M31:~$ export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/
If that looks ok Iâll so some runs to get timing data ?
Jo
Did the run in accordance with your last mail. (may 19)
LD_LIBRARY_PATH setting etc are included in the attached output. Other recent mails of mine are not so relevant (i think)
Jo
Run_22_5_23.txt (5.63 KB)