Nvfortran is roughly 10 times slower then intel for OpenMP matrix multiplication code

Hello,
this is regarding the CPU performance (not GPU).
I can’t figure out why the following OpenMP fortran code for matrix multiplication runs roughly 10 times slower if compiled with nvfortran or gfortran compare to intel. The compiler.sh script and the fortran code are attached.

module load intel nvhpc/21.9 gcc/11.1.0
[ilkhom@t019 ORIG]$ bash ./compiler.sh intel
The code is compiled with Intel
Number of threads |   Time (sec)
    1             |      2.33
    2             |      1.23
    3             |      0.85
    4             |      0.67
    5             |      0.56
    6             |      0.46
    7             |      0.44
    8             |      0.38
    9             |      0.37
   10             |      0.32
   11             |      0.32
   12             |      0.28
   13             |      0.27
   14             |      0.26
   15             |      0.24
   16             |      0.23
[ilkhom@t019 ORIG]$ bash ./compiler.sh gfortran
The code is compiled with gfortran
Number of threads |   Time (sec)
    1             |     31.60
    2             |     15.89
    3             |     11.33
    4             |      8.04
    5             |      7.01
    6             |      5.65
    7             |      4.97
    8             |      4.33
    9             |      4.15
   10             |      3.74
   11             |      3.54
   12             |      3.11
   13             |      2.96
   14             |      2.67
   15             |      2.59
   16             |      2.35
[ilkhom@t019 ORIG]$ bash ./compiler.sh nvfortran
The code is compiled with nvfortran
Number of threads |   Time (sec)
    1             |     31.47
    2             |     15.75
    3             |     10.73
    4             |      8.03
    5             |      6.87
    6             |      5.67
    7             |      4.99
    8             |      4.33
    9             |      4.19
   10             |      3.71
   11             |      3.40
   12             |      3.10
   13             |      3.03
   14             |      2.66
   15             |      2.75
   16             |      2.34

Kind regards,
Ilkhom
compiler.sh (561 Bytes)

Hi Iikhom,

I can take a look, but can you please provide the “mm.f90” source file you’re using?

I don’t see any binding settings in your script. How does setting the environment variable “OMP_PROC_BIND=true” effect performance?

What CPU architecture are you using? How many sockets/cores does the system have?

-Mat

mm.f90 (1007 Bytes)
Hi Mat,
For some reason the system did not allow me to upload two files when I created the post. So, I created it anyway having in mind to upload the second file at later stage.
I tried to set “OMP_PROC_BIND=true” but it did not improve the performance.
Thanks,
Ilkhom

We have the following CPU architecture:

[ilkhom@t020 test]$ lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
Stepping:              7
CPU MHz:               999.908
CPU max MHz:           3500.0000
CPU min MHz:           1000.0000
BogoMIPS:              5000.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              11264K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15

Yes, it looks like they’re doing a better job of optimizing this. Serially if you use the MATMUL intrinsic which is highly optimized, you’d get about the same performance. Of course that doesn’t help with OpenMP.

I can do a few things in your code to help get use a lot closer and disabling FMA (i.e. -Mnofma) seems to help here. The two items were to hoist the initialization of “C” and interchange the loops for better memory access.

Note that I added dummy call to “foo(C)” since without OpenMP, dead code elimination can eliminate the entire loop give the results of “C” aren’t used.

$ cat mmA1.f90
program matrix_multiply
    use omp_lib
    implicit none
    !integer, parameter :: N = 3000
    integer, parameter :: N = 3200
    integer :: i, j, k
    real :: start_time, end_time, ctmp
    real :: A(N,N), B(N,N), C(N,N)
    integer :: time_1, time_2, delta_t, countrate, countmax
    real*8 :: secs

    ! Initialize matrices A and B
    do i = 1, N
        do j = 1, N
            A(i,j) = i + j
            B(i,j) = i - j
        end do
    end do
    C = 0.0

    call system_clock(count_max=countmax, count_rate=countrate)
    call system_clock(time_1)

    ! Compute matrix multiplication
    !$omp parallel do shared(A,B,C) private(i,j,k)
       do j = 1, N
          do k = 1, N
            do i = 1, N
                C(i,j) = C(i,j) + A(i,k) * B(k,j)
            end do
        end do
    end do
    !$omp end parallel do
    call system_clock(time_2)
    delta_t = time_2-time_1
    secs = real(delta_t)/real(countrate)
    call foo(C)
!    print *, C(:10,1)
    print *, secs
!    write(6,'(1I5,13X,1A,F10.2)')OMP_GET_MAX_THREADS(),'|',secs

end program matrix_multiply