Nvidia-smi does not show GPU use while using cusparse library methods for CSR SpMV operation

i am trying to perform Sparse matrix vector multiplication with csr format. I am allocating space on device for sparse matrix, as well as dense multiplicative vector x and output vector y, and then performing SpMV with using cusparseDcsrmv cusparse API. I am repeating SpMV procedure 100 times but nvidia-smi (even with ‘watch nvidia-smi’) does not showing any GPU use. I am using GPU-server which has 4 nvidia GTX 1000 Ti devices.
Whether it using 100 % CPU (using ‘top’). I am using matrix from sparse suit collection dataset. I have a question does cusparse library functions (cusparseDcsrmv(.,.) etc) only use GPU for spmv computation or it may use CPU also. Here is my code which i am using. plese help me if i am doing something wrong.

xHostPtr = (double*)malloc(n * sizeof(xHostPtr[0])); // allocating vector x
yHostPtr = (double*)malloc(m * sizeof(yHostPtr[0])); // allocating vector y
if ((!xHostPtr) || (!yHostPtr)) {
CLEANUP(“Host malloc failed (vectors)”);
return -1;
for (int i = 0; i < m; i++) {
yHostPtr[i] = 0.0;
for (int i = 0; i < n; i++) {
xHostPtr[i] = 1.0;
/* allocate GPU memory and copy the matrix and vectors x,y into it /
cudaStat1 = cudaMalloc((void
*)&csrRowPtr, (m + 1) * sizeof(csrRowPtr[0]));
cudaStat2 = cudaMalloc((void**)&ColIndex, nnz * sizeof(ColIndex[0]));
cudaStat3 = cudaMalloc((void**)&Values, nnz * sizeof(Values[0]));
cudaStat4 = cudaMalloc((void**)&y, m * sizeof(y[0]));
cudaStat5 = cudaMalloc((void**)&x, n * sizeof(x[0]));
if ((cudaStat1 != cudaSuccess) || (cudaStat2 != cudaSuccess) || (cudaStat3 != cudaSuccess) ||
(cudaStat4 != cudaSuccess) || (cudaStat5 != cudaSuccess)) {
CLEANUP(“Device malloc failed”);
return -1;
cudaStat1 = cudaMemcpy(csrRowPtr, csrRowHostPtr, (size_t)((m + 1) * sizeof(csrRowPtr[0])),
cudaStat2 = cudaMemcpy(ColIndex, ColIndexHostPtr, (size_t)(nnz * sizeof(ColIndex[0])),
cudaStat3 =
cudaMemcpy(Values, ValuesHostPtr, (size_t)(nnz * sizeof(Values[0])), cudaMemcpyHostToDevice);
cudaStat4 = cudaMemcpy(y, yHostPtr, (size_t)(m * sizeof(y[0])), cudaMemcpyHostToDevice);
cudaStat5 = cudaMemcpy(x, xHostPtr, (size_t)(n * sizeof(x[0])), cudaMemcpyHostToDevice);
if ((cudaStat1 != cudaSuccess) || (cudaStat2 != cudaSuccess) || (cudaStat3 != cudaSuccess) ||
(cudaStat4 != cudaSuccess) || (cudaStat5 != cudaSuccess)) {
CLEANUP(“Memcpy from Host to Device failed”);
return -1;

status = cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, m, n, nnz, &alpha, descr,
Values, csrRowPtr, ColIndex, &x[0], &beta, &y[0]);

Please help me. i tried this with all formats supported by cusparse but none of the API use shows GPU use while i do nvidia-smi.

I’m not able to compile your code. However cusparseDcsrmv definitely uses the GPU, and according to my testing with a large enough problem you can witness GPU activity using nvidia-smi.

If in doubt, just run a profiler. If the profiler shows you a csrmv kernel that is running for a few microseconds or a few milliseconds, it’s no surprise you don’t see it on nvidia-smi, which samples at around a 1 second interval:

26.74%  3.8080us         1  3.8080us  3.8080us  3.8080us  void csrMv_kernel<double, double, double, int=128, int=2>(cusparseCsrMvParams<double, double, double>)