Nvfortran 23.9 problem

HPC SDK 23.9 just released.

I compiled my program with nvfortran 23.9 in WSL2 Ubuntu 22.04.2 and had some problems in comparison with older compilers that I used previously nvfortran 21.3 and nvfortran 22.1

  1. Compilation is too slow. Much more slowly than with older compilers.
  2. Input and output of executable built with nvfortran 23.9 are much more slower than executable built with older compilers
  3. Executable built with nvfortran 23.9 computes noticeably slower than executable built with older compilers. At the same time GPU-Z utility shows noticeably lower GPU load.

Has anyone encountered these problems and is there a solution?

I am confused.
This problem seems to be related to WSL.

I installed nvfortran 21.3 and obtain the same

  1. Slow compilation
  2. Slow input/output
  3. Lower GPU utilization. And slower computing

But I don’t have this problem on other Windows machine under WSL
Other machine: Windows 11 22H2, WSL2 Ubuntu 20.04 LTS, nvfortran 21.3-0
This machine: Windows 11 Pro for Workstations, 22H2, WSL2 Ubuntu 20.04.6 LTS, nvfortran 21.3-0

What could be the cause?

The problem with compilation was due to the fact that I was working on a different hard drive. As soon as I moved the working folder to the system hard drive, compilation and I/O became fast.

But computations is still slower than on other machine. Even though the other machine has a less powerful GPU.

WSL version is the same but this machine has CUDA Version: 12.2, Driver Version: 536.67
Other machine - CUDA Version: 11.8, Driver Version: 522.25
Could the problem be in CUDA or driver version?

Hi Maxim,

Possible but doubtful, though I don’t have any other ideas on what it could be.

Are you able to profile the code with Nsight-systems on both platforms to see where the difference in performance occurs?

Is it the kernel runtime, data movement between device and host, or the CPU time?

-Mat

Hi, Mat.

Thanks for the reply.
Profiling will take time.

I can make some assumptions without profiling though.
CPU time is negligible. Almost all the computations, as far as I remember (the program was written a long time ago), are performed by GPU.
Also I think the data exchange between CPU and GPU should have little impact. The main arrays are loaded into the GPU at the very beginning and copied to the host only for output and post-processing.

My guess is that efficiency is lost on the interaction between WSL and GPU.
There is an old executable built a long time ago with PGI Fortran 19.10 for Windows.
So this executable computes faster than the executable built with NVFortran 21.3 in WSL.
On another machine, both executables compute with approximately the same speed. The file built by NVFortran 21.3 in WSL is even slightly faster.

I forgot to write, maybe it makes some difference.
The graphics card on that machine is GeForce RTX 2060.
The graphics card on this machine is Quadro RTX 5000

Ok, I thought you were using WSL in both cases as opposed to WSL versus native Windows.

I only have limited experience with WSL and as I understand the Linux install is running in a virtual machine but light-weight so has little overhead. However my guess is that the CUDA driver needs to go through some type of emulation layer so may have some overhead. Though I’m not an expert here so I don’t know details.

Best I can offer is to point you to the NIVIDIA docs on using WSL2 to see if there’s anything in there that might help: 1. NVIDIA GPU Accelerated Computing on WSL 2 — CUDA on WSL 12.3 documentation

I used WSl with NVFortran 21.3 and Windows with PGI Fornran 19.10 on both devices.

The computational time was
GeForce RTX 2060 (Win11, PGI Fornran 19.10) - 16.898 s
GeForce RTX 2060 (WSL, NVFortran 21.3) - 14.284 s

Quadro RTX 5000 (Win11, PGI Fornran 19.10) - 12.834 s
Quadro RTX 5000 (WSL, NVFortran 21.3) - 23.626 s

It does seem more likely an issue with WSL on this particular system, which unfortunately I can’t help you with, but if you can share your code, I can run it on Linux to see if there are any slow downs between 21.3 and 19.10.

If I do find a slow-down, then I’ll be able help, but otherwise, you need to review the WSL docs I pointed you to earlier.

Hi Mat.

I too think it’s a problem in the interaction between WSL and Windows on this particular machine.
But this problem should not be unique, it is either related to WSL settings or driver compatibility, CUDA, etc.
Unfortunately the WSL document that you pointed out to me earlier did not provide any useful information to solve the problem.

Additional information

I checked that the CUDA Sample nbody built in Windows is a bit slower than the one built in WSL

Win11 (nvcc 11.6):
1000192 bodies, total time for 10 iterations: 27140.543 ms
= 368.594 billion interactions per second
= 7371.879 single-precision GFLOP/s at 20 flops per interaction

WSL (nvcc 12.3):
1000192 bodies, total time for 10 iterations: 23870.596 ms
= 419.086 billion interactions per second
= 8381.727 single-precision GFLOP/s at 20 flops per interaction

But my OpenACC program in WSL is noticeably slower than Win PGI Fortran 19.10 version

Win11 (PGI Fortran 19.10):
Time of main loop: 13.588 seconds.

WSL (nvfortran 23.9-0):
Time of main loop: 23.853 seconds.

So it’s probably not a WSL settings issue, but a driver and CUDA compatibility issue

Win CUDA is 11.6
But nvidia-smi on WIn shows
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+

WSL CUDA is 12.3
nvidia-smi in WSL shows
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
For my compilation I used nvfortran from HPC SDK 23-9 with multiple CUDA installation 12.2, 11.8, 11.0

Again, if you can share the code, then I can run it on Linux to see if the slow down is reproduceable.

I have a Linux machine and there is no slowdown on it.
This is only a WSL issue

I installed the HPC SDK Version 23.11 today and also noticing slow compilation and standard output compared to g++ and clang++.

g++

$ time g++ -o llil4hmap -std=c++20 -fopenmp -Wall -O3 llil4hmap.cc -I./parallel-hashmap

real  0m3.574s
user  0m3.441s
sys   0m0.124s

$ ./llil4hmap /data1/input/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.759 secs
hmap to vector         1.043 secs
vector stable sort     1.715 secs
write stdout           0.868 secs
total time             7.386 secs
    count lines     323398400
    count unique    200483043
701308064 1804347429

clang++

$ time clang++ -o llil4hmap -std=c++20 -fopenmp -Wall -O3 llil4hmap.cc -I./parallel-hashmap

real  0m2.963s
user  0m2.887s
sys  0m0.068s

$ ./llil4hmap /data1/input/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.759 secs
hmap to vector         0.710 secs
vector stable sort     1.125 secs
write stdout           0.702 secs
total time             6.298 secs
    count lines     323398400
    count unique    200483043
701308064 1804347429

nvc++

$ time nvc++ -o llil4hmap -std=c++20 -fopenmp -Wall -O3 llil4hmap.cc -I./parallel-hashmap

real  0m21.274s
user  0m20.828s
sys   0m0.413s

$ ./llil4hmap /data1/input/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.804 secs
hmap to vector         0.697 secs
vector stable sort     1.104 secs
write stdout           5.071 secs
total time            10.678 secs
    count lines     323398400
    count unique    200483043
701308064 1804347429

Hi marioeroy,

Are you able to provide the data set so we can investigate?

It appears the main difference is with the “out_properties” so might be an issue with “ordered”, but I’m not sure.

Is this issue WSL2 specific, or more general?

-Mat

Disclaimer: I’m new to C++ and simply trying things at the time.

Input data for the “Rosetta Code: Long List is Long” challenge is generated by a Perl script gen-llil.pl, mentioned at the top of the file. I generated 92 input files (each 31 MB) including shuffling the data via shuffle.pl. Six input files will suffice and not necessary to create 92 files.

gen-llil.pl URL

Scroll down the page a little. I used the updated gen-llil.pl script, found under “Updated Test File Generators”.

perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big2.txt 200 3 1
...
perl gen-llil.pl big92.txt 200 3 1

shuffle.pl URL

The source is at the top of the page (6 lines).

perl shuffle.pl big1.txt >tmp && mv tmp big1.txt
perl shuffle.pl big2.txt >tmp && mv tmp big2.txt
...
perl shuffle.pl big92.txt >tmp && mv tmp big92.txt

I found two slowness issues depending on whether MAX_STR_LEN_L is defined, line 97.

  1. MAX_STR_LEN_L defined: The slowness is caused by str.append(s.c_str()) in “out_properties”, line 281.

  2. Commenting out line 97, the other slowness is found in get_properties str_type s(beg_ptr, klen), line 228.

Those are the two noticeable slowness comparing nvc++ vs clang++.

A better example is llil4map.cc. Please use this one. The parallel-hashmap usage in the prior example is not a typical pattern, but something I tried out of curiosity.

Here are the two slowness issues compared to clang++. Testing: 92 input files on an AMD Threadripper 3970X box.

./llil4map /data1/input/big* | cksum
  1. MAX_STR_LEN_L defined, line 96. Like before, “out_properties” is noticeably slower than clang++ due to str.append(s.c_str()), line 281.
      clang++ write stdout:  0.704 secs.
        nvc++ write stdout:  4.211 secs.
  1. Next, I commented out line 96, MAX_STR_LEN_L definition. There is slowness with str_type s(beg_ptr, klen) found in “get_properties”, line 228.
      clang++ get properties:  4.735 secs.
        nvc++ get properties:  5.418 secs.

I gave it try but for some reason I’m not able to reproduce the good times that you’re getting from g++ and clang++. Irrespective if the “MAX_STR_LEN_L” setting, all compiler’s write time are around 5 seconds.

Which versions of g++ and clang++ are you using? I’m using 12.3 and 16.0.

Are there any environment variables you’re using?

In your case, could g++ and clang++ be buffering the output rather than flushing?