Hello!

I am using cublas and cusolver to calculate the solution of a matrix Ax=y. I used a function named “qr” to calculate the solution. I need to call this function 7802*1000 times in my original computation.

Here is my function:

```
void qr(float *A_d, float *y_d, float *res_d, float *x_d, float *R_d, float *A1_d, float *y1_d, float *Q_d, float *TAU_d, int *devInfo, float *work, int work_size, int m, int n, int Iter, cusolverDnHandle_t solver_handle, cublasHandle_t cublas_handle) {
//*** cuSOLVE input/output parameters ***//
//*** Transfering host memory to device memory ***//
int f2=0;
if (m*n>128){ f2 = m*n/128;}
dim3 dimBlock_A(m/f2+1,n);
dim3 dimGrid_A(f2+1,1);
copy2 <<< dimGrid_A, dimBlock_A>>> (A1_d, A_d, y1_d, y_d, m, n);
//*** CUDA QR initialization ***//
//*** CUDA GERF execution ***//
cusolverDnSgeqrf(solver_handle, m, n, A_d, m, TAU_d, work, work_size, devInfo);
//At this point, the upper triangular part of A contains the elements of R.
//*** CUDA QR execution to find Q ***//
cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_N, m, n, n, A_d, m, TAU_d, Q_d, m, work, work_size, devInfo);
// At this point, Q_d contains the elements of Q.
//*** CUDA QR execution to fint Q'*y ***//
cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_T, m, n, n, A_d, m, TAU_d, y_d, m, work, work_size, devInfo);
// At this point, y_d contains the elements of Q^T * y, where y is the data vector.
// Reducing the linear system size
dim3 Grid(1,1);
dim3 Block(n, n);
update<<<Grid, Block>>>(A_d, R_d, y_d, x_d, m, n);
// --- Solving an upper triangular linear system
const float alpha = 1.;
cublasStrsm(cublas_handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, n, &alpha, R_d, n, x_d, n);
prod <<< 1, m >>> (A1_d, x_d, res_d, y1_d, m, n);
return;
}
```

I need to optimize this code.So, I used “nvprof” command to know which part of the code is taking much time. I ran this function 7802 times and in this process I am not using any shift of memory from device to host except for transferring final result. So, according to me my memcpy call from device to host should be 7802 only.

But the nvprof command is showing me 46812 transfer from device to host.

My question is how is it possible?? Is this is because of cusaolver and cublas functions I am using in my code??

here is the output of my nvprof command :https://drive.google.com/file/d/0BwKTZw-Pex8rbVVPWkdWdDIwOG8/view?usp=sharing