cuBlas - MemCopy problem

bledos · May 14, 2016, 10:27am

Hello everyone,

i used cuBlas function cublasDgetrfBatched() to evalute LU decomposition of square matrix, here is the code:

void cublas_lu(double *a, int n, int batchsize )
{
cublasInit();
cublasHandle_t handle;
cublasCreate_v2(&handle);
int *P, *INFO;
double *a_d;

cudaMalloc(&P, n * batchsize * sizeof(int));
cudaMalloc(&INFO, batchsize * sizeof(int));

cudaMalloc(&a_d, n * n * sizeof(double));
cudaMemcpy(a_d, a, n * n * sizeof(double),cudaMemcpyHostToDevice);

double *A = { a_d };
double ** A_d;

cudaMalloc<double*>(&A_d,sizeof(A));
cudaMemcpy(A_d,A,sizeof(A),cudaMemcpyHostToDevice);

cublasDgetrfBatched(handle,n,A_d,n,P,INFO,batchsize);

cudaMemcpy( a, a_d, n * n * sizeof(double),cudaMemcpyDeviceToHost);
cublasDestroy_v2(handle);
}

Everything works fine, but last copying data from device to host:

cudaMemcpy( a, a_d, n * n * sizeof(double),cudaMemcpyDeviceToHost);

take too long time, for matrix 10000x10000 take unbelievable about ~260 sec…
Typical time for copying this amount of data should take ~1-2 sec. max.
Interisting thing is, when i run code without cublasDgetrfBatched(), copying of data take “normal” time 2-3 sec. what was expected.

I tried cudaDeviceSynchronize(), cudaMemcpyAsync(), cublasGetMatrix() and nothing works :(

Anybody can help?

Robert_Crovella · May 14, 2016, 1:30pm

It’s because the cublasDgetrfBatched routine is taking that long for such a large matrix.
The cudaMemcpy appears to take a long time because it is waiting for the cublas routine to finish (as it should).

The routines cublasDgetrfBatched and DgetriBatched are intended for use with decomposition/inversion of a batch of small matrices. This is mentioned in the documentation:

[url]http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-getrfbatched[/url]

“This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.”

Use another method for LU decomposition/inversion of a large matrix.

Topic		Replies	Views
cublas routine and cudamemcpy speed GPU->CPU transfer speed affected by cublasDsyr call?! CUDA Programming and Performance	2	9289	October 27, 2009
CUBLAS: Very low occupancy CUDA Programming and Performance	3	1061	December 11, 2015
Problem with using CUBLAS getrfbatched() as it returns error: Illegal memory access was encountered GPU-Accelerated Libraries	3	4554	February 2, 2018
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3459	December 26, 2008
Time taken by cublasSetVector() ? makes my application worst CUDA Programming and Performance	10	11501	October 25, 2007
Really strange memcpy time in matrixMul at SDK CUDA Programming and Performance	6	5126	July 9, 2009
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5026	February 10, 2011
Cublas batched lu decomposition get segmentation fault GPU-Accelerated Libraries	3	1206	April 23, 2014
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3542	January 17, 2008
using cuda/cublas in a vector/matrix library CUDA Programming and Performance	2	8601	May 18, 2007

cuBlas - MemCopy problem

Related topics