cublasDgeam subroutine in cusolver library produces wrong result

I have one row major matrix of size M*N and want to apply dgeam subroutine of cublas to this matrix to have one column major ordered matrix. Yet when i print the transposed matrix, i see that a resultant matrix is produced whose each element is 0 and I could not find out what i am doing wrong and why each entry of the matrix is assigned to 0.

cublasHandle_t handle;
    cublasStatus_t status;

    status = cublasCreate(&handle);

    if (status != CUBLAS_STATUS_SUCCESS)
        		printf("cublasCreate returned error code %d, line(%d)\n", status, __LINE__);
    /* Transpose of matrix V , because it is row rank */

    int M = 8 ; 

    int N = 12;

    unsigned int size_V=M*N;
    unsigned int mem_size_V=sizeof(double)*size_V;

    double* h_V;

    for(int i=0; i<M; i++){
      for(int j= 0; j<N; j++){
         //row major
         h_V[j*n+i] = i*j;

    double* d_V;
    CudaSafeCall(cudaMalloc((void**) &d_V, mem_size_V));

    const double alf = 1.0;
    const double bet = 0.0;
    const double *alpha = &alf;
    const double *beta = &bet;

    double* clone;
    double* clone_d ;
    CudaSafeCall(cudaMalloc((void**) &clone_d, mem_size_V));
    CudaSafeCall(cudaMemcpy(clone_d, clone, mem_size_V, cudaMemcpyHostToDevice));

    dim3 grid(1,1,1);
    dim3 block(16,1,1);

    gpuCopy<<<grid, block>>>(clone_d,d_V,M,N);

    CudaSafeCall(cudaMemcpy(clone, clone_d, mem_size_V, cudaMemcpyDeviceToHost));
     // copy matrix is correct
    for(int b; b<10; b++) 
            std::cout << clone[b] << '\t' << std::endl;


    CudaSafeCall(cudaMemcpy(clone_d, clone, mem_size_V, cudaMemcpyHostToDevice));
    CublasSafeCall(cublasDgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, M, N, alpha, clone_d, N, beta, clone_d, M, d_V, M));

    CudaSafeCall(cudaMemcpy(h_V, d_V, mem_size_V, cudaMemcpyDeviceToHost));

    // each entry in tranposed matrix is 0
    for(int b; b<10; b++)  
         std::cout << h_V[b] << '\t' << std::endl;



Your code doesn’t make any sense. At this point in your code:

double* clone;
    clone=(double*)malloc(mem_size_V);  // you are allocating clone here
    double* clone_d ;                   // but you never store anything in it
    CudaSafeCall(cudaMalloc((void**) &clone_d, mem_size_V));
    CudaSafeCall(cudaMemcpy(clone_d, clone, mem_size_V, cudaMemcpyHostToDevice)); // then you copy it to device here

you have allocated space on the host for clone, and on the device for clone_d. But you have not initialized the contents of clone to anything. So that last cudaMemcpy line above makes no sense. clone contains garbage, and you are copying that garbage to the device.

Also, up until this point in the code, you have allocated and initialized h_V, but you have only allocated d_V. You have not copied or stored anything in d_V. Therefore, even though you haven’t shown the definition, this kernel call could not be doing anything useful:

gpuCopy<<<grid, block>>>(clone_d,d_V,M,N);

because both d_V and clone_d contain garbage at this point.

The rest of your code could not be doing anything useful either, since you are working with device arrays that contain garbage.

Thanks for your reply.

As you also mention this code either produces garbage values or only assigns each entry of resultant matrix to 0.

I have noticed that i did not send the contents of h_V to device memory d_V, but i can not understand what is wrong for clone_d and clone variables? I just create some empty memory in device and host and want to copy the content of the matrix, which i want to transpose to clone_d in the kernel and send it back to host variable clone, so i can give it to cublasDgeam subroutine as argument. It is like an uninitialized variable to put the result in it after kernel works. Suppose that d_V was given its value with cudamemcopy, would clone_d and clone also still carry junk values?