Hello there!

I am performing a basic Matrix Multiply using CUDA Fortran and C without any optimizations. Both Fortran and C are doing the exact same thing but the execution time for Fortran is slower.

This question has already been asked without any closure to it.

C Kernel

```
#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {
//Global inidices
int tx = blockIdx.y * blockDim.y + threadIdx.y;
int ty = blockIdx.x * blockDim.x + threadIdx.x;
int k;
if (tx < SIZE && ty < SIZE) {
double accum = 0.0;
//Accumulation for (tx,ty) position
for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}
d_matSym[idx(tx,SIZE,ty)] = accum;
}
}
//Call
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();
```

Fortran Kernel

```
attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)
integer :: tx, ty, k
real*8 :: accum
real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y
if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat(tx,k) * d_matT(k,ty)
end do
d_matSym(tx,ty) = accum
end if
end subroutine matrixMultiply
!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()
```

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

If it is the memory access then in both cases, the K loop accesses the one matrix contiguously and another access jumps by SIZE.

Both produce the same results.

For 16384 x 16384 Matrix, C : 5.4 sec Fortran : 6.3 sec

If both codes have the same operations and use the same data, then the gap between executions should not be that large.

If a different assembly is generated for both codes. How to save that PTX file for nvfortran or pgfortran commands. Both of them do not have an option to generate that. Or is it that Fortran generates a normal .asm code?