Does cusolver and cublas have multiple shifting of memory between host and device??


I am using cublas and cusolver to calculate the solution of a matrix Ax=y. I used a function named “qr” to calculate the solution. I need to call this function 7802*1000 times in my original computation.
Here is my function:

void qr(float *A_d, float *y_d, float *res_d, float *x_d, float *R_d, float *A1_d, float *y1_d, float *Q_d, float *TAU_d, int *devInfo, float *work, int work_size,  int m, int n, int Iter, cusolverDnHandle_t solver_handle, cublasHandle_t cublas_handle) {

    //*** cuSOLVE input/output parameters ***//

//*** Transfering host memory to device memory ***//
    int f2=0;
    if (m*n>128){ f2 = m*n/128;}
    dim3 dimBlock_A(m/f2+1,n);
    dim3 dimGrid_A(f2+1,1);

copy2 <<< dimGrid_A, dimBlock_A>>> (A1_d, A_d, y1_d, y_d, m, n);

    //*** CUDA QR initialization ***//
    //*** CUDA GERF execution ***//

    cusolverDnSgeqrf(solver_handle, m, n, A_d, m, TAU_d, work, work_size, devInfo);

//At this point, the upper triangular part of A contains the elements of R.

    //*** CUDA QR execution to find Q ***//

    cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_N, m, n, n, A_d, m, TAU_d, Q_d, m, work, work_size, devInfo);

    // At this point, Q_d contains the elements of Q. 

//*** CUDA QR execution to fint Q'*y ***//

    cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_T, m, n, n, A_d, m, TAU_d, y_d, m, work, work_size, devInfo);

    // At this point, y_d contains the elements of Q^T * y, where y is the data vector.

// Reducing the linear system size

dim3 Grid(1,1); 
    dim3 Block(n, n);

update<<<Grid, Block>>>(A_d, R_d, y_d, x_d, m, n);

// --- Solving an upper triangular linear system

    const float alpha = 1.;
    cublasStrsm(cublas_handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, n, &alpha, R_d, n, x_d, n);

    prod <<< 1, m >>> (A1_d, x_d, res_d, y1_d, m, n);



I need to optimize this code.So, I used “nvprof” command to know which part of the code is taking much time. I ran this function 7802 times and in this process I am not using any shift of memory from device to host except for transferring final result. So, according to me my memcpy call from device to host should be 7802 only.
But the nvprof command is showing me 46812 transfer from device to host.
My question is how is it possible?? Is this is because of cusaolver and cublas functions I am using in my code??
here is the output of my nvprof command :

Yes, its certainly possible that library functions are transferring data from device to host. With a little bit more effort with creation of a test code and the use of nvprof, you can confirm whether this is the case yourself.

If that is the case, then what are the options in my hand to optimise my code?? I need to use the above function 7802*1000 times and thus a lot of shifting of memory accounted by nvprof command.
As you know cudamemcpy is not time efficient and makes code slow. A lot usage of cudamemcpy making my code very slow. What are the options in my hand?? how can I optimise my code?? Is there is any way I can find the source code of cusolver and cublas functions??