Using CuBlas from gfortran... need Help!

Hi every body!!! I want to use Cublas from gfortranbut I got some problems :-(
Following Fatica lecture (ppt presentation on the web) i tried the SGEMM example (THUNKING & NON_THUNKING) for a simple example of matrix multiplication. The THUNKING program works fine but the NON_THUNKING is not working. It compiles well, but does not give me the good result for C = A.B ( C=0 instead of 1 in my program). Could some one can help me?

Here is the program:

program ex_sgemm
implicit none
integer :: n,i
real, allocatable :: A(:,:), B(:,:), C(:,:)
integer8 :: devPtrA, devPtrB, devPtrC
integer :: size_of_real = 16
call cublas_init()
! allocation et initialisation des matrices CPU
write(6,
) ’ enter n ’
read(5,) n
allocate( A(n,n), B(n,n), C(n,n) )
! transfert des données sur la GPU
call cublas_alloc(n
n, size_of_real, devPtrA)
call cublas_alloc(nn, size_of_real, devPtrB)
call cublas_alloc(n
n, size_of_real, devPtrC)
A = 1.0
B = 1.0 / float(n)
C = 0.0
call cublas_Set_Matrix(n,n, size_of_real, A, n, devPtrA, n)
call cublas_Set_Matrix(n,n, size_of_real, B, n, devPtrB, n)
call cublas_Set_Matrix(n,n, size_of_real, C, n, devPtrC, n)
! appele la librairie CUBLAS
call cublas_Get_Matrix(n,n, size_of_real, C, n, devPtrC, n)
write(6,*) 'C recuperee ’
do i = 1, 10
write(6,10) C(i,1:10)
enddo

call CUBLAS_SGEMM( ‘n’, ‘n’, n, n, n, 1.0, devPtrA, n, devPtrB, n, 1.0, devPtrC, n )

!recupere la GPU -> CPU
call cublas_Get_Matrix(n,n, size_of_real, C, n, devPtrC, n)
write(6,*) 'C calculee ’
do i = 1, 10
write(6,10) C(i,1:10)
enddo
deallocate( A, B, C)
call cublas_free(devPtrA)
call cublas_free(devPtrB)
call cublas_free(devPtrC)
10 format(10(2x,f10.5))
end program ex_sgemm

and the compile.sh:

nvcc -O3 -c /usr/local/cuda/src/fortran.c
gfortran -O3 *.o fortran_non_thunking.f90 -o toto_non
-L/usr/local/cuda/lib64 -lcudart -lcublas