Nvfortran is roughly 10 times slower then intel for OpenMP matrix multiplication code

ilkhomab · March 10, 2023, 12:20pm

Hello,
this is regarding the CPU performance (not GPU).
I can’t figure out why the following OpenMP fortran code for matrix multiplication runs roughly 10 times slower if compiled with nvfortran or gfortran compare to intel. The compiler.sh script and the fortran code are attached.

module load intel nvhpc/21.9 gcc/11.1.0
[ilkhom@t019 ORIG]$ bash ./compiler.sh intel
The code is compiled with Intel
Number of threads |   Time (sec)
    1             |      2.33
    2             |      1.23
    3             |      0.85
    4             |      0.67
    5             |      0.56
    6             |      0.46
    7             |      0.44
    8             |      0.38
    9             |      0.37
   10             |      0.32
   11             |      0.32
   12             |      0.28
   13             |      0.27
   14             |      0.26
   15             |      0.24
   16             |      0.23
[ilkhom@t019 ORIG]$ bash ./compiler.sh gfortran
The code is compiled with gfortran
Number of threads |   Time (sec)
    1             |     31.60
    2             |     15.89
    3             |     11.33
    4             |      8.04
    5             |      7.01
    6             |      5.65
    7             |      4.97
    8             |      4.33
    9             |      4.15
   10             |      3.74
   11             |      3.54
   12             |      3.11
   13             |      2.96
   14             |      2.67
   15             |      2.59
   16             |      2.35
[ilkhom@t019 ORIG]$ bash ./compiler.sh nvfortran
The code is compiled with nvfortran
Number of threads |   Time (sec)
    1             |     31.47
    2             |     15.75
    3             |     10.73
    4             |      8.03
    5             |      6.87
    6             |      5.67
    7             |      4.99
    8             |      4.33
    9             |      4.19
   10             |      3.71
   11             |      3.40
   12             |      3.10
   13             |      3.03
   14             |      2.66
   15             |      2.75
   16             |      2.34

Kind regards,
Ilkhom
compiler.sh (561 Bytes)

MatColgrove · March 10, 2023, 5:25pm

Hi Iikhom,

I can take a look, but can you please provide the “mm.f90” source file you’re using?

I don’t see any binding settings in your script. How does setting the environment variable “OMP_PROC_BIND=true” effect performance?

What CPU architecture are you using? How many sockets/cores does the system have?

-Mat

ilkhomab · March 12, 2023, 9:58am

mm.f90 (1007 Bytes)
Hi Mat,
For some reason the system did not allow me to upload two files when I created the post. So, I created it anyway having in mind to upload the second file at later stage.
I tried to set “OMP_PROC_BIND=true” but it did not improve the performance.
Thanks,
Ilkhom

ilkhomab · March 14, 2023, 3:32am

We have the following CPU architecture:

[ilkhom@t020 test]$ lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
Stepping:              7
CPU MHz:               999.908
CPU max MHz:           3500.0000
CPU min MHz:           1000.0000
BogoMIPS:              5000.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              11264K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

MatColgrove · March 14, 2023, 8:36pm

Yes, it looks like they’re doing a better job of optimizing this. Serially if you use the MATMUL intrinsic which is highly optimized, you’d get about the same performance. Of course that doesn’t help with OpenMP.

I can do a few things in your code to help get use a lot closer and disabling FMA (i.e. -Mnofma) seems to help here. The two items were to hoist the initialization of “C” and interchange the loops for better memory access.

Note that I added dummy call to “foo(C)” since without OpenMP, dead code elimination can eliminate the entire loop give the results of “C” aren’t used.

$ cat mmA1.f90
program matrix_multiply
    use omp_lib
    implicit none
    !integer, parameter :: N = 3000
    integer, parameter :: N = 3200
    integer :: i, j, k
    real :: start_time, end_time, ctmp
    real :: A(N,N), B(N,N), C(N,N)
    integer :: time_1, time_2, delta_t, countrate, countmax
    real*8 :: secs

    ! Initialize matrices A and B
    do i = 1, N
        do j = 1, N
            A(i,j) = i + j
            B(i,j) = i - j
        end do
    end do
    C = 0.0

    call system_clock(count_max=countmax, count_rate=countrate)
    call system_clock(time_1)

    ! Compute matrix multiplication
    !$omp parallel do shared(A,B,C) private(i,j,k)
       do j = 1, N
          do k = 1, N
            do i = 1, N
                C(i,j) = C(i,j) + A(i,k) * B(k,j)
            end do
        end do
    end do
    !$omp end parallel do
    call system_clock(time_2)
    delta_t = time_2-time_1
    secs = real(delta_t)/real(countrate)
    call foo(C)
!    print *, C(:10,1)
    print *, secs
!    write(6,'(1I5,13X,1A,F10.2)')OMP_GET_MAX_THREADS(),'|',secs

end program matrix_multiply

Topic		Replies	Views
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	7074	July 14, 2010
Fortran with OpenMP almost no speedup Legacy PGI Compilers	15	12789	August 20, 2014
CUDA Matrix Multiply on Fortran is slower than C Legacy PGI Compilers cuda	1	717	May 14, 2021
Poor OpenMP performance, compared to GCC gfortran Legacy PGI Compilers	6	4172	February 29, 2012
OpenMP runs only on one core (twice) nvc, nvc++ and nvfortran	2	1162	January 13, 2021
Nvfortran implicit reduction is faster than explicit reduction? nvc, nvc++ and nvfortran hpc-sdk	4	123	October 22, 2025
RTX 5070 slower than V100 with nvfortran -stdpar=gpu nvc, nvc++ and nvfortran gpu	4	244	June 24, 2025
Nvfortran、cuda fortran、openmp nvc, nvc++ and nvfortran	1	477	November 29, 2022
NVFortran openmp offloading to multiple GPUs nvc, nvc++ and nvfortran	10	326	August 22, 2025
poor pgi openmp performance?? Legacy PGI Compilers	17	20525	August 3, 2012

Nvfortran is roughly 10 times slower then intel for OpenMP matrix multiplication code

Related topics