Is there any function in cublas limit the matrix size less than or equal to 48

I’m working on a program to do matrix Block Householder QR decomposition with cublas. The algorithm is showed in this picture:

!************* code... set dQ = I ***************

    if(blockSize <= 1) blockSize = m
    do k=1,m,blockSize
        realSize=min(m-k+1, blockSize)
        ldm = m-k+1
        do j=k,k+realSize-1
            ldl = m-j+1; ljk = j-k+1
            call house(ldm,ljk,dA(k:m,j),dv,beta)
            B(ljk) = beta
            call cublasSgemm('n','t',ldl,ldl,1,1.0,dv(ljk),ldm,dv(ljk),ldm,0.0,vvt,ldl)
            call cublasSgemm('n','n',ldl,k+realSize-j,ldl,-1.0*beta,vvt,ldl,dA(j,j),lda,1.0,dA(j,j),lda)
            !$cuf kernel do <<<*,*>>>
            do i=1,ldm
                dY(i,ljk) = dv(i); dW(i,ljk) = -1.0*beta*dv(i)
                if(i+k-1 > j) dA(i+k-1,j) = dv(i)
            end do
        end do
        deallocate(dv); allocate(wy(ldm,ldm))
        do i=2,realSize
            call cublasSgemm('n','t',ldm,ldm,i-1,1.0,dW(1,1),ldm,dY(1,1),ldm,0.0,wy,ldm)
            call cublasSgemm('n','n',ldm,1,ldm,-1.0*B(i),wy,ldm,dY(1,i),ldm,1.0,dW(1,i),ldm)
        end do
        if(k+realSize .LE. m) then
            call cublasSgemm('n','t',ldm,ldm,realSize,1.0,dY,ldm,dW,ldm,0.0,wy,ldm)
            call cublasSgemm('n','n',ldm,m-k-realSize+1,ldm,1.0,wy,ldm,dA(k,k+realSize),lda,1.0,dA(k,k+realSize),lda)
        end if
        call cublasSgemm('n','t',ldm,ldm,realSize,1.0,dW,ldm,dY,ldm,0.0,wy,ldm)
        call cublasSgemm('n','n',m,ldm,ldm,1.0,dQ(1,k),m,wy,ldm,1.0,dQ(1,k),m)
    end do
    !**********check result: dQ = dQ * R **************
    call cublasStrmm('r','u','n','n',m,m,1.0,dA,m,dQ,lda)
    print *,"Max error:",maxval(abs(A - dQ))

Problem: the order of matrix A less than or equal to 48, the program could return right result.

Once the order bigger than 48, the result returned is wrong!

Error description:

the order of matrix A less than or equal to 48:

any value of blocksize between 1 to 48 could get right result

the order of matrix A bigger than 48:

if the blocksize less than 48, the first blocksize columns get the right result, the rest get wrong. if the blocksize equal to or bigger than 48, only the first 48 columns get right result, the rest get wrong.

It seems like somewhere in the code have a limit, limit the columns to 48 size

the cuBlas functions I used: cublasSgemm cublasStrmm cublasSnrm2 cublasSdot cublasSscal

do they have the limit ???

I have checked memory allocate / deallocate with stat info, no wrong message.

Please help me !!!