cusparse concurrency using streams


I’m using cusparse library to do matrix/vector multiplications. I would like to make use of streams to launch kernels concurrently, but it doesn’t seem to work. Runtimes are about the same with or without streams. The profiler is showing serial kernel execution when streams are enabled. I’m using “Tesla C2075” GPU device which supports concurrent kernel execution.

Here is my code:

int numOfMult = 100;
int numStreams = 8;
int N = 100000;             // dimensions of the matrix
int nz = (N-2)*3 + 4;	// # of non-zero values

// use Compressed Sparse Row Format (CSR) to represent matrices
thrust::host_vector hCsrRowIdx(N+1);
thrust::host_vector hCsrColIdx(nz, 0);
thrust::host_vector hCsrVal(nz, 0.0);
thrust::host_vector hX(N, 2.0);
thrust::host_vector hManualAx(N, 0.0);
thrust::host_vector hAx(N*numOfMult, 0.0);
thrust::host_vector hStreamVec(numStreams);
for (int i = 0; i < numStreams; ++i)

// generate matrix	
genTridiag(hCsrRowIdx, hCsrColIdx, hCsrVal, N, nz);

// create all the gpu device memory
thrust::device_vector dCsrRowIdx = hCsrRowIdx;
thrust::device_vector dCsrColIdx = hCsrColIdx;
thrust::device_vector dCsrVal = hCsrVal;
thrust::device_vector dX = hX;
thrust::device_vector dAx(N*numOfMult, 0.0);

cusparseHandle_t cusparseHandle = 0;
cusparseStatus_t cusparseStatus = cusparseCreate(&cusparseHandle);
cusparseMatDescr_t descr = 0;
int offset = 0, streamIdx = 0;
for(int i = 0; i < numOfMult; ++i)
    offset = i*N;
    cusparseStatus = cusparseSetStream(cusparseHandle, hStreamVec[streamIdx]);
    cusparseStatus = cusparseDcsrmv(cusparseHandle, 
	if (streamIdx >= numStreams)
	   streamIdx = 0;

// wait for all streams to finish
cudaError_t cudaStatus = cudaDeviceSynchronize();

Is cusparse library blocking concurrent execution of kernels? Is Implicit Synchronization getting triggered? Any insight into this is much appreciated.

1)Your matrix (100.000 x 100.000) is big enough to fill up the GPU so you should not get much overlap of the kernels.
2)Using the profiler might affect also the kernel overlapping (especially on Fermi)

  1. If you have the same matrix and just change the vector, you should use csrmm instead of csrmv, and compute multiple vectors at once which should give you a nice speedup.

Thanks for your response philippev.

Re: 1) That’s correct. When I decrease matrix dimensions to 1000 x 1000 and increase # of multiplications than GPU utilization degrades significantly and runtimes converge to CPU levels. It seems like concurrent kernel execution is not getting triggered. Any other ideas?

Re: 3) Good point.


Do you find the solution to your problem?

I am also facing the some problem.

My question is: Do we need to associate separate handle to each stream?