Dear all:

I modified the example code of matrix multiplication in the file “CUDA Fortran Programming Guide and Reference” as below because I tried to apply for the arbitrary dimensions.

It’s seemed something wrong when I tested with two matrices Adev(568,568) and Bdev(568, 2902).

There were always errors larger than 1.E-3…:(

The dimensions of grid and block were

```
dimGrid = dim3( (568-1)/16+1, (2902-1)/16+1, 1 )
dimBlock = dim3( 16, 16, 1 )
```

How should I modify my code?

Thank you in advance.

Feng

```
attributes(global) subroutine gpu_cal_coef( Adev, Bdev, Cdev, NB, M, L)
implicit none
integer, value :: NB, M, L
real*8, device :: Adev(NB,M), Bdev(M,L), Cdev(NB,L)
integer, device :: i, j, kb, k, tx, ty
real*8, shared :: Asub(16,16), Bsub(16,16)
real*8, device :: Cij
! Start execution, first get my thread indices
tx = threadidx%x
ty = threadidx%y
! This thread computes C(i,j) = sum(A(i,:) * B(:,j))
i = (blockidx%x-1)*16 + tx
j = (blockidx%y-1)*16 + ty
Cij = 0.d0
do kb = 1, M, 16
if (i<=NB .and. kb+ty-1<=M)then !<--modification
Asub(tx,ty) = Adev(i,kb+ty-1)
else
Asub(tx,ty) = 0.d0 !<--modification
end if
if (kb+tx-1<=M .and. j<=L)then !<--modification
Bsub(tx,ty) = Bdev(kb+tx-1,j)
else
Bsub(tx,ty) = 0.d0 !<--modification
end if
call syncthreads()
do k = 1,16
Cij = Cij + Asub(tx,k)*Bsub(k,ty)
enddo
call syncthreads()
enddo
Cdev(i,j) = Cij
end subroutine gpu_cal_coef
```