Numerical error in matmul results between using fixed-size and dynamic arrays

Hi all,

I encounter a weird numerical behaviour of dynamic arrays and matrix multiplication with pgi fortran. In the sample code below, I am passing dynamic arrays and fixed-size arrays into subroutines and do the matrix multiplication inside of the subroutines, respectively, and then I check if the two give the same result. Originally I thought they should be the same as the only difference between the two here is the way of defining the array.

module mathOps
    subroutine matmul_fix(x,y,result,N,M)
        real,dimension(N,M) :: x, y
        real,dimension(N,M) :: result

        result = matmul(x, y)
    end subroutine matmul_fix

    subroutine matmul_dynamic(x,y,result)
        real,dimension(:,:) :: x, y
        real,dimension(:,:) :: result

        result = matmul(x, y)
    end subroutine matmul_dynamic
end module mathOps

program testMatMul
    use mathOps
    implicit none
    integer, parameter :: N = 256
    integer, parameter :: M = 256
    real :: x_mat(N, M), y_mat(N, M),z_mat(N, M),z_mat2(N, M)

    call random_number(x_mat)
    call random_number(y_mat)

    call matmul_dynamic(x_mat, y_mat, z_mat)
    call matmul_fix(x_mat, y_mat, z_mat2, N, M)

    write(*,*) 'Error between fixed and dynamic ',sum(abs(z_mat-z_mat2))
end program testMatMul

I compile the above code as follow:

pgf90 -o test test.f90

and the output is

Error between fixed and dynamic    0.7801056

Surprisingly, I find that there is a significant difference between the two! I also notice that such error is reduced if I add a -r8 compiler flag to enforce double precision.

pgf90 -o test test.f90 -r8
Error between fixed and dynamic    1.6325714113918366E-009

But such difference is still quite unsettling as I also try to compile with Intel and GNU Fortran compiler, and their results are both machine precision 0:

ifort -o test test.f90
Error between fixed and dynamic    0.0000000E+00

gfortran -o test test.f90
Error between fixed and dynamic    0.0000000E+00

So it seems that there is a compiler-related issue. Could anyone enlighten me what is the cause of such discrepancy? Here is the version information of the pgi compiler I am using:

pgf90 19.4-0 LLVM 64-bit target on x86-64 Linux -tp skylake 
PGI Compilers and Tools
Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.



Yes, it is unfortunate but it looks like we choose a different matmul algorithm when the bounds are compile-time constants vs. when they are passed in at runtime. Both answers are correct within round-off error. You can see this in a couple of ways. To ensure exact arithmetic for a problem this size, do something like this:

call random_number(x_mat)
call random_number(y_mat)
x_mat = real(int(x_mat * 10.0))
y_mat = real(int(y_mat * 10.0))

Also, if you look at any of the maximum differences:
write(,) 'Max diff between fixed and dynamic ',maxval(abs(z_mat-z_mat2))

You’ll see that the max difference for any point is around 7.e-5, which for a 256 element dot product is reasonable, and when you add 65536 of those together, you get to the value you print out.

Hi brentl,

Thanks a lot for the explanation. Indeed the max error is a better metric than the total error sum, and from that the difference is not that bad. But I have to say 7e-5 is still a little bit higher than my comfortable expectation of round-off error for single precision float… Nevertheless my intention is to raise the awareness of this particular behaviour of pgfortran to the community since matmul is such a fundamental operation and I did not find other related discussion about it online.