Compiling Fortran code to run on rtx 4090

Hi , I have been developing Fortan codes to solve hydrodynamic problems (potential flow around floating objects) for a number of years. For the last few years I have been using the windows version of the PGI 19.10 compiler in order to run the codes on CPU + GPU systems. The codes are mostly run on gtx 1080 ti. I have been very pleased with the performance of the codes on GPU and have received a lot of help from PGI and Nvidia in getting the codes compiled.
Recently , I updated my desktop and had a rtx 4090 unit fitted. It seems that I now have a problem because I am not able to run my existing codes on the rtx 4090 . On checking the availability of a suitable GPU, the programs immediately switch to CPU on start-up and will not use the 4090.
I have not been able to find any information why this is so and what the solution is. I know that NVIDIA has up to now not updated the windows version of the compilers. I have invested a huge amounts of time to develop these codes and fear that I have made the investment in the new GPU for nothing. Which path is now open to me ?
jo_rotter

1 Like

Hi Jo_rotter,

You might try compiling with “-tp=nollvm” so a PTX version of the device code is generated which in turn gets JIT compiled on the 4090. I haven’t tried this myself so I can’t be sure, but worth a try.

The second option is to move to Linux and install the latest version of the NVHPC compilers. While we don’t officially support it, I’ve had several users have success with using WSL2 so you can run Linux directly on your Windows systems.

See the following guide on setting up WSL2 with CUDA support: NVIDIA GPU Accelerated Computing on WSL 2

Note, when running your code under WSL2, be sure to set LD_LIBRARY_PATH in your environment to include “/usr/lib/wsl/lib” or where ever libcuda.so was installed. Otherwise the runtime can’t find the CUDA driver.

-Mat

Hi Mat,
We followed up your suggestion to use WSL2 on Windows 11. CUDA 12.1 was already installed and Ubuntu was installed through WSL 2 . Following that the latest compiler 23.3 was installed . Using the guide “NVIDIA GPU Accelerated Computing on WSL 2” the NVIDIA Linux GPU driver was downloaded and installed following the steps on the CUDA dowload page for WSL-Ubuntu. Finally, LD_LIBRARY_PATH was set to include “/usr/lib/wsl/lib”. We checked that libcuda.so was installed at that location.
We took all the files of the code and successfully compiled and linked the program.
We used the following “build-delfrac.sh” file (leaving out the 70 odd *.for files which make up the program) :
#!/bin/bash
O_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/math_libs/12.0/targets/x86_64-linux/lib/stubs
LAPACK_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/lib

cd obj

nvfortran -c -cuda -Minfo=ftn -cudalib=cusolver -fortranlibs -acc=gpu -gpu=ccnative -target=gpu -cudaforlibs 
/src/(70 files)

nvfortran -c -g -Mbackslash -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Minform=warn -Mvect=levels:8 -Mlarge_arrays -Mcudalib=cusolver 
/src/(70 files)

echo “Compiling
”
nvfortran -c -Mextend -traceback -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Mcudalib=cusolver 
/src/(70 files)

nvfortran -c -gopt -C -traceback -Mbounds -Mchkptr -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mcuda -Mcudalib=cusolver 
/src/(70 files)

echo “Linking
”

nvfortran -acc -Mcuda -acc=gpu,multicore -stdpar=gpu,multicore -gpu=ccall -Mvect=levels:8 -Mlarge_arrays -Mcudalib=cusolver,cublas -llapack -lblas -fortranlibs -cudaforlibs -o 
/bin/delfrac (70 *.o files)

.@@@

The program was run using the rtx 4090.

The program ran except it was extremely slow ! Using the GPUShark tool it showed that the 4090 core usage was continuously at 100% load all the time using only 18% TDP
When I say slow I mean each step took about 40 s while my previous GPU (GTX 1080 TI) was only taking about 5 s for each step ! The answer were , however , correct !

Can you shed some light on this ?

Best regards,

Jo

Hi Jo,

Given the large performance difference, my first guess would be that the application is running multicore CPU and not on the GPU. The GPU utilization could be due to the used of cuSolver.

If you run the command “nvaccelinfo”, is the runtime able to find the GPU?

Can you run a profile, i.e. “nsys profile <my_app>”, and then review the report via “nsys stats report1.nsys-rep”, to ensure all the GPU kernels are being offloaded.

If it is running on the GPU, where is the application spending most of it’s time? In a particular kernel? Data movement? Poor across all kernels?

-Mat

Hi Mat,

Looks like Bingo for the first assumption !

johannesp@DESKTOP-45G6M31:~$ nvaccelinfo
No accelerators found.
Try nvaccelinfo -v for more information
johannesp@DESKTOP-45G6M31:~$ nvaccelinfo -v
libcuda.so not found
No accelerators found.
Check that you have installed the CUDA driver properly
Check that your LD_LIBRARY_PATH environment variable points to the CUDA runtime installation directory
johannesp@DESKTOP-45G6M31:~$

Now the solution ? Which sequence to install CUDA driver properly ? We thought we had set LD_LIBRARY_PATH properly but apparently not.

Jo

Let’s assume the driver is installed properly per the doc I linked to above.

Where is “libcuda.so” installed? Is it in “/usr/lib/wsl/lib” or some place else?

Next verify that LD_LIBRARY_PATH includes this directory.

Is the application getting invoked via a shell or other script? If so, is the environment getting inherited?

Hi Mat,

Re-installed LD_LIBRARY_PATH
using "export LD_LIBRARY_PATH=/usr/lib/wsl/lib

ran nvaccelinfo -v :

johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$ nvaccelinfo -v

CUDA Driver Version: 12010

Device Number: 0
Device Name: NVIDIA GeForce RTX 4090
Device Revision Number: 8.9
Global Memory Size: 25756696576
Number of Multiprocessors: 128
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 2550 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 10501 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 75497472 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
Preemption Supported: Yes
Cooperative Launch: Yes
Default Target: cc86
johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$

GPU present , no error regarding libcuda.so

The LD_LIBRARY_PATH Setting gets lost on closing Linux session

Program is run from Ubuntu command window : johannesp@DESKTOP-45G6M31:~/Test_Dirk_2/bin$ ./delfrac

Started a run : Same answers as before: Deadly slow but GPU at full blast.

Try adding the LD_LIBRARY_PATH to your .bashrc file so it gets set implicitly.

Are you are to run a profile with Nsight-Systems (nsys)?

Hi Mat,

I tried adding ‘export LD_LIBRARY_PATH=/usr/lib/wsl/lib’ to bash.bashrc
Made no difference to performance.
While looking for bash.bashrc I noticed there were more than one version. I added the path to
a version which had more 'export ’ lines

I ran nsys profile .
That produced two files , report1.nsys-rep and report1.sqlite. Both attached here.

Jo

report1.nsys-rep (312 KB)

report1.sqlite (804 KB)

The profile confirms that the code is not running on the GPU.

The typically reason for this is due to the runtime not being able to find the CUDA driver (libcuda.so), but may be failing due to other reasons.

Let’s try to narrow this down by compiling and running one of the example OpenACC codes that ship with the compilers. Here’s what I did on my laptop’s WSL2 Ubuntu install:

mcolgrove@NV-JZG9LG3:~/tmp$ cp /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/examples/OpenACC/samples/acc_f1/acc_f1.f90 .
mcolgrove@NV-JZG9LG3:~/tmp$ nvfortran -acc -Minfo=accel acc_f1.f90
main:
     28, Generating implicit copyin(a(1:n)) [if not already present]
         Generating implicit copyout(r(1:n)) [if not already present]
     29, Loop is parallelizable
         Generating NVIDIA GPU code
         29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
mcolgrove@NV-JZG9LG3:~/tmp$ export NV_ACC_NOTIFY=1
mcolgrove@NV-JZG9LG3:~/tmp$ a.out
launch CUDA kernel  file=/home/mcolgrove/tmp/acc_f1.f90 function=main line=29 device=0 threadid=1 num_gangs=782 num_workers=1 vector_length=128 grid=782 block=128
       100000 iterations completed
 Test PASSED

“NV_ACC_NOTIFY=1” has the OpenACC runtime print a line each time a kernel is launched so we can tell if the code actually ran on the GPU.

If this works for you as well, then we can look what’s different about your applications. If this doesn’t work, then the issue is more likely system related, such as the CUDA driver.

-Mat

It did just the same :

johannesp@DESKTOP-45G6M31:~$ cp /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/examples/OpenACC/samples/acc_f1/acc_f1.f90 .
johannesp@DESKTOP-45G6M31:~$ nvfortran -acc -Minfo=accel acc_f1.f90
main:
28, Generating implicit copyin(a(1:n)) [if not already present]
Generating implicit copyout(r(1:n)) [if not already present]
29, Loop is parallelizable
Generating NVIDIA GPU code
29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
johannesp@DESKTOP-45G6M31:~$ export NV_ACC_NOTIFY=1
johannesp@DESKTOP-45G6M31:~$ ./a.out
launch CUDA kernel file=/home/johannesp/acc_f1.f90 function=main line=29 device=0 threadid=1 num_gangs=782 num_workers=1 vector_length=128 grid=782 block=128
100000 iterations completed
Test PASSED
johannesp@DESKTOP-45G6M31:~$

Jo

Ok, so it’s something to do with your program.

Which model(s) are you using, OpenACC, CUDA Fortran, Fortran DO CONNCURRENT?

I seem to recall it being OpenACC with calls to cuSolver.

The flags you’re using enables all three, so we should turn off those you’re not using.

Also, let’s remove “multicore” and explicitly set “cc86” rather than “ccall”, or remove “-gpu=ccall” flag altogether and have the compiler auto-detect the gpu.

i.e. something like:

nvfortran -c -gopt -acc=gpu -cuda -cudalib=cusolver

(note that we’re dropping the “-M” from “-cuda” and “-cudalib”)

Finally, double check that you’re not setting “ACC_DEVICE_TYPE=HOST” in your environment. Assuming you’re using OpenACC, keep “NV_ACC_NOTIFY=1” set so you can check if the kernels are being launched.

-Mat

I modified the flags as you suggested , removing “multicore” , left the compiler to autodetect the gpu and dropped M from “-cuda” and “-cudalib”
I kept “NV_ACC_NOTIFY=1” and ran the example again. The output is attached. The results are for 1 frequency step of 100 in total.
Total time for 1 frequency step was about 40 s while computations on a laptop with gtx 1080 (not 1080 TI) cost about 2 s for one frequency for exactly the
same case. That code was compiled under windows 10 using PGI 19.10 and the same fortran source codes as now being used for the unix case on the rtx 4090.
The output for the unix case shows calls to the following routines which use the gpu :

  • Ult_inf_channel_b.for
  • Multrlid.for
  • FALTMICH.FOR
  • Velocit_C_min7.FOR
    As it stands, most time is spent in Multrlid.for between line no 58 and line no 118.
    The calls to cusolver are in that interval. See attached Multrlid.for (with some superfluous comments)
    Ult_inf_channel_b.for is a routine you pretty much optimized for me a few years ago
    Also attached a screen shot showing the output and the data on the GPU in GPUShark.

Jo

Multrlid.for (4.06 KB)

Test_case.txt (6.46 KB)

Ok, so the good news is that it is actually running, but I have no idea why it’s slower.

The kernels in Multigrid.for are very small and run with few threads (68 grids x 128 threads per grid), so it doesn’t quite make sense that these would be slow.

I’m wondering if the slow-down is due to data movement or possibly the cuSolver calls?

Can you try getting another Nsight-system profile? Hopefully now that we can confirm that it’s running on the GPU, we can hope nsys can capture the kernel times.

Try running “nsys profile -o run1 ”, and then “nsys stats run1.nsys-rep > run1.txt”. Then post run1.txt.

If it still has issues generating a profile, it could be that it can’t find libcupti.so, which is the device profiler library. Try finding it on your system, and set LD_LIBRARY_PATH to include this directory.

As a fall back, you can try setting “NV_ACC_TIME=1” to get a basic profile from our runtime. If it only prints the host times (elapsed) and not the device time, then it can’t find libcupti.so.

Ran "nsys profile -o run1 program " followed by "nsys stats run1.nsys-rep?>run1.txt

Run1.txt is attached

Jo

run1.txt (2.4 KB)

Hmm, still nothing in the profile.

Support for WSL2 was added to Nsight-Systems 2022.4, and you should have an up to date version installed. My only guess is that libcupti.so isn’t being found.

Did you added the directory where it’s located to you’re LD_LIBRARY_PATH? Something like: “/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/”

Did you try running the OpenACC profiler by setting NV_ACC_TIME=1 in your environment? If it can’t find libcupti, it can still profile from the host side.

I found libcupti.so more or less by chance and then added the location as follows :

export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64

To check the location I then did the following:

johannesp@DESKTOP-45G6M31:~$ ls -l /usr/local/cuda-12.1/lib64/

and then got a list of files at that location among them :
lrwxrwxrwx 1 root root 14 Apr 4 03:05 libcupti.so → libcupti.so.12
lrwxrwxrwx 1 root root 20 Apr 4 03:05 libcupti.so.12 → libcupti.so.2023.1.1
-rw-r–r-- 1 root root 7419504 Apr 4 03:05 libcupti.so.2023.1.1
-rw-r–r-- 1 root root 18490376 Apr 4 03:05 libcupti_static.a

I guess the address has been added to LD_LIBRARY_PATH or do have to do that in another way?

Jo

I also did a run preceded by
NV_ACC_TIME=1

but got no timing data at all

Jo

I used your address to see if it would work for me ; it did !
johannesp@DESKTOP-45G6M31:~$ ls -l /opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/

Resulted in a list of files including a series of ‘libcupti.so’ versions

I added the address using :
johannesp@DESKTOP-45G6M31:~$ export LD_LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/profilers/Nsight_Systems/target-linux-x64/

If that looks ok I’ll so some runs to get timing data ?

Jo

Did the run in accordance with your last mail. (may 19)

LD_LIBRARY_PATH setting etc are included in the attached output. Other recent mails of mine are not so relevant (i think)

Jo

Run_22_5_23.txt (5.63 KB)