CUDA Matrix Multiply on Fortran is slower than C

Hello there!

I am performing a basic Matrix Multiply using CUDA Fortran and C without any optimizations. Both Fortran and C are doing the exact same thing but the execution time for Fortran is slower.

This question has already been asked without any closure to it.

C Kernel

#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k<SIZE; ++k) {
            accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
        }
        d_matSym[idx(tx,SIZE,ty)] = accum;
    }
}
//Call  
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();

Fortran Kernel

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

    integer :: tx, ty, k
    real*8 :: accum
    real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
    
    tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
    ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

    if (tx <= SIZE_ .and. ty <=SIZE_) then
        accum = 0.0
        do k=1, SIZE_
            accum = accum + d_mat(tx,k) * d_matT(k,ty)
        end do
        d_matSym(tx,ty) = accum
    end if
end subroutine matrixMultiply

!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

If it is the memory access then in both cases, the K loop accesses the one matrix contiguously and another access jumps by SIZE.

Both produce the same results.

For 16384 x 16384 Matrix, C : 5.4 sec Fortran : 6.3 sec
If both codes have the same operations and use the same data, then the gap between executions should not be that large.

If a different assembly is generated for both codes. How to save that PTX file for nvfortran or pgfortran commands. Both of them do not have an option to generate that. Or is it that Fortran generates a normal .asm code?

Try making the declaration of the arrays fix size rather than using assumed size.

real*8, dimension(SIZE_,SIZE_) :: d_mat, d_matT, d_matSym

Otherwise the compiler needs to add some code to determine the size and offsets of the arrays.

1 Like