Does cusolver and cublas have multiple shifting of memory between host and device??

Aditya1295 · July 18, 2017, 9:15am

Hello!

I am using cublas and cusolver to calculate the solution of a matrix Ax=y. I used a function named “qr” to calculate the solution. I need to call this function 7802*1000 times in my original computation.
Here is my function:

void qr(float *A_d, float *y_d, float *res_d, float *x_d, float *R_d, float *A1_d, float *y1_d, float *Q_d, float *TAU_d, int *devInfo, float *work, int work_size,  int m, int n, int Iter, cusolverDnHandle_t solver_handle, cublasHandle_t cublas_handle) {

    //*** cuSOLVE input/output parameters ***//

//*** Transfering host memory to device memory ***//
    int f2=0;
    if (m*n>128){ f2 = m*n/128;}
    dim3 dimBlock_A(m/f2+1,n);
    dim3 dimGrid_A(f2+1,1);

copy2 <<< dimGrid_A, dimBlock_A>>> (A1_d, A_d, y1_d, y_d, m, n);

    //*** CUDA QR initialization ***//
    //*** CUDA GERF execution ***//

    cusolverDnSgeqrf(solver_handle, m, n, A_d, m, TAU_d, work, work_size, devInfo);

//At this point, the upper triangular part of A contains the elements of R.

    //*** CUDA QR execution to find Q ***//

    cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_N, m, n, n, A_d, m, TAU_d, Q_d, m, work, work_size, devInfo);

    // At this point, Q_d contains the elements of Q. 

//*** CUDA QR execution to fint Q'*y ***//

    cusolverDnSormqr(solver_handle, CUBLAS_SIDE_LEFT, CUBLAS_OP_T, m, n, n, A_d, m, TAU_d, y_d, m, work, work_size, devInfo);

    // At this point, y_d contains the elements of Q^T * y, where y is the data vector.

// Reducing the linear system size

dim3 Grid(1,1); 
    dim3 Block(n, n);

update<<<Grid, Block>>>(A_d, R_d, y_d, x_d, m, n);

// --- Solving an upper triangular linear system

    const float alpha = 1.;
    cublasStrsm(cublas_handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, n, &alpha, R_d, n, x_d, n);

    prod <<< 1, m >>> (A1_d, x_d, res_d, y1_d, m, n);

return;

}

I need to optimize this code.So, I used “nvprof” command to know which part of the code is taking much time. I ran this function 7802 times and in this process I am not using any shift of memory from device to host except for transferring final result. So, according to me my memcpy call from device to host should be 7802 only.
But the nvprof command is showing me 46812 transfer from device to host.
My question is how is it possible?? Is this is because of cusaolver and cublas functions I am using in my code??
here is the output of my nvprof command :https://drive.google.com/file/d/0BwKTZw-Pex8rbVVPWkdWdDIwOG8/view?usp=sharing

Robert_Crovella · July 18, 2017, 1:53pm

Yes, its certainly possible that library functions are transferring data from device to host. With a little bit more effort with creation of a test code and the use of nvprof, you can confirm whether this is the case yourself.

Aditya1295 · July 19, 2017, 6:26am

If that is the case, then what are the options in my hand to optimise my code?? I need to use the above function 7802*1000 times and thus a lot of shifting of memory accounted by nvprof command.
As you know cudamemcpy is not time efficient and makes code slow. A lot usage of cudamemcpy making my code very slow. What are the options in my hand?? how can I optimise my code?? Is there is any way I can find the source code of cusolver and cublas functions??

Topic		Replies	Views
Matrix multiplication performance CUDA Programming and Performance	2	1168	August 3, 2013
cublas routine and cudamemcpy speed GPU->CPU transfer speed affected by cublasDsyr call?! CUDA Programming and Performance	2	9326	October 27, 2009
Suggestion and advice on porting a linear algebra intensive subroutine to cuda Legacy PGI Compilers	2	582	June 1, 2020
cuSolver cusolverRfBatchSetupDevice CUDA Programming and Performance	0	471	January 22, 2016
cuBlas - MemCopy problem GPU-Accelerated Libraries	1	545	May 14, 2016
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	778	October 12, 2021
Cublas, keep results on device CUDA Programming and Performance	1	8738	June 25, 2010
Efficient repeated copying of a vector CUDA Programming and Performance	10	3280	August 24, 2023
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3492	December 26, 2008
cuSOLVER from kernel GPU-Accelerated Libraries	4	1422	June 30, 2019

Does cusolver and cublas have multiple shifting of memory between host and device??

Related topics