cusparse concurrency using streams

dennisNYC · January 16, 2013, 3:51pm

Hello,

I’m using cusparse library to do matrix/vector multiplications. I would like to make use of streams to launch kernels concurrently, but it doesn’t seem to work. Runtimes are about the same with or without streams. The profiler is showing serial kernel execution when streams are enabled. I’m using “Tesla C2075” GPU device which supports concurrent kernel execution.

Here is my code:

int numOfMult = 100;
int numStreams = 8;
int N = 100000;             // dimensions of the matrix
int nz = (N-2)*3 + 4;	// # of non-zero values

// use Compressed Sparse Row Format (CSR) to represent matrices
thrust::host_vector hCsrRowIdx(N+1);
thrust::host_vector hCsrColIdx(nz, 0);
thrust::host_vector hCsrVal(nz, 0.0);
thrust::host_vector hX(N, 2.0);
thrust::host_vector hManualAx(N, 0.0);
thrust::host_vector hAx(N*numOfMult, 0.0);
thrust::host_vector hStreamVec(numStreams);
for (int i = 0; i < numStreams; ++i)
    cudaStreamCreate(&hStreamVec[i]);

// generate matrix	
genTridiag(hCsrRowIdx, hCsrColIdx, hCsrVal, N, nz);

// create all the gpu device memory
thrust::device_vector dCsrRowIdx = hCsrRowIdx;
thrust::device_vector dCsrColIdx = hCsrColIdx;
thrust::device_vector dCsrVal = hCsrVal;
thrust::device_vector dX = hX;
thrust::device_vector dAx(N*numOfMult, 0.0);

cusparseHandle_t cusparseHandle = 0;
cusparseStatus_t cusparseStatus = cusparseCreate(&cusparseHandle);
cusparseMatDescr_t descr = 0;
cusparseCreateMatDescr(&descr); 
cusparseSetMatType(descr,CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ZERO);
	
int offset = 0, streamIdx = 0;
for(int i = 0; i < numOfMult; ++i)
{
    offset = i*N;
    cusparseStatus = cusparseSetStream(cusparseHandle, hStreamVec[streamIdx]);
    cusparseStatus = cusparseDcsrmv(cusparseHandle, 
				CUSPARSE_OPERATION_NON_TRANSPOSE, 
				N, 
			         N, 
				nz, 
				&alpha, 
				descr,
				thrust::raw_pointer_cast(&dCsrVal[0]),
				thrust::raw_pointer_cast(&dCsrRowIdx[0]),
				thrust::raw_pointer_cast(&dCsrColIdx[0]),
				thrust::raw_pointer_cast(&dX[0]),
				&beta,
				thrust::raw_pointer_cast(&dAx[offset]));
	++streamIdx;
	if (streamIdx >= numStreams)
	   streamIdx = 0;
}

// wait for all streams to finish
cudaError_t cudaStatus = cudaDeviceSynchronize();

Is cusparse library blocking concurrent execution of kernels? Is Implicit Synchronization getting triggered? Any insight into this is much appreciated.

philippev · January 18, 2013, 7:24pm

1)Your matrix (100.000 x 100.000) is big enough to fill up the GPU so you should not get much overlap of the kernels.
2)Using the profiler might affect also the kernel overlapping (especially on Fermi)

If you have the same matrix and just change the vector, you should use csrmm instead of csrmv, and compute multiple vectors at once which should give you a nice speedup.

dennisNYC · January 18, 2013, 10:04pm

Thanks for your response philippev.

Re: 1) That’s correct. When I decrease matrix dimensions to 1000 x 1000 and increase # of multiplications than GPU utilization degrades significantly and runtimes converge to CPU levels. It seems like concurrent kernel execution is not getting triggered. Any other ideas?

Re: 3) Good point.

gaurav_singhal · July 19, 2013, 1:24pm

@dennisNYC

Do you find the solution to your problem?

I am also facing the some problem.

My question is: Do we need to associate separate handle to each stream?

Topic		Replies	Views
Concurrent Kernels with CUSPARSE Library CUDA Programming and Performance	4	819	October 22, 2015
Hundreds of parallel matrix-vector multiplications with cuBLAS GPU-Accelerated Libraries	8	2245	April 8, 2021
My streams are not running concurrently CUDA Programming and Performance	7	1741	March 6, 2018
Streaming cuSolver GPU-Accelerated Libraries	2	1610	June 9, 2015
Concurrent executions of streams CUDA Programming and Performance	6	420	December 19, 2022
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2397	April 29, 2019
Streams and multiprocessor usage? CUDA Programming and Performance	3	2893	September 20, 2008
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3006	January 19, 2018
Why streams cant run concurrently CUDA Programming and Performance	4	917	March 22, 2018
CUDA stream performance CUDA Programming and Performance	5	2374	July 23, 2013

cusparse concurrency using streams

Related topics