a cublas problem

Hi , i have a little problem with a sgemm cublas .

I wold like to create a cycle like this :

real (fp_kind), dimension(:,:), allocatable ::      A, B, C

      real ::      time_start,time_end

      real (fp_kind)::      alpha=1._fp_kind,beta=1._fp_kind, c_right

      integer::  i,j,m1,m2

      integer :: stat,stat2,stat3

      integer:: size_of_real=16

      integer*8:: devPtrA, devPtrB, devPtrC

c CUBLAS

      external cublas_init, cublas_set_matrix, cublas_get_matrix

      external cublas_shutdown, cublas_alloc

      integer cublas_alloc

C CUBLAS

      call cublas_init()

do m1=128,2560,32

       print *,m1

       allocate(A(m1,m1))

       allocate(B(m1,m1))

       allocate(C(m1,m1))

stat=cublas_Alloc(m1*m1,size_of_real, devPtrA)

       stat2=cublas_Alloc(m1*m1,size_of_real, devPtrB)

       stat3=cublas_Alloc(m1*m1,size_of_real, devPtrC)

! Initialize the matrices A,B and C

       A=1._fp_kind

       B=2._fp_kind

       C=3._fp_kind       

call cublas_Set_Matrix(m1,m1,size_of_real,A,m1,

     .           devPtrA,m1) 

call cublas_Set_Matrix(m1,m1,size_of_real,B,m1,

     .           devPtrB,m1)

call cublas_Set_Matrix(m1,m1,size_of_real,C,m1,

     .           devPtrC,m1)

call cublas_SGEMM ('n','n',m1,m1,m1,alpha,devPtrA,m1,

     .              devPtrB,m1,beta,devPtrC,m1)

call cublas_Free(devPtrA)

       call cublas_Free(devPtrB)

       call cublas_Free(devPtrC)

       deallocate(A,B,C)

      end do

call cublas_shutdown()

But when i’m exec i have only one iteration . Why ?

Thanks for help !!!

For one thing, your size_of_real is wrong.
A single precision floating point value is 4 byte, you are using 16.

Thank you.
Now the cycle go on , but when i compute cputime this values are 0 .
So i suppose that dgemm not calling correctly .
right ?

You never copy back the results. Unless you add a cudaDeviceSynchronize or copy back C, your timing will be incorrect.

thank you . I copy back C and i have what i wont .

There is another question . I wuold like to confrontation Blas dgemm and cublas dgemm. I found that the cublas dgemm faster then mkl dgemm .

But in my application dgemm mkl is better then cublas dgemm .

So i would like to know if there is a place when i read anything in this way .