Upgrading to CUDA 12.4 broke down the application

Hi,
I recently upgraded from CUDA 12.3 to CUDA 12.4. I use cuSparse ad cuBlas in my application for doing algebras. And it is on Windows 11. After upgrade and rebuild the application with CUDA 12.4 toolkit, it throws error such as illegal memory access on the device.

I’ve tried using compute-sanitizer to debug. It seems that this error happens within the cusparse_SpMV routine (SpMV_buffersize doesn’t throw any error). I’ve checked the matrix and vectors and confirmed that they were correct allocated on device before entering SpMV. So I really couldn’t find out what could be wrong since everything worked well with CUDA toolkit 12.3.

I found the following known issue from the release note: cusparseSpMV introduces invalid memory accesses when the output vector is not aligned to 16 bytes. Would this be the potential cause? If yes, how could I fix it? Thanks.

Regards,
Lu

Perhaps.

If you are using compute-sanitizer, it will provide additional information about the invalid memory access error, and if this is the reason (16 byte alignment) I would expect the compute-sanitizer output to also include a statement like “address xxxxxx is misaligned”.

Alignment in CUDA is discussed here. The simplest way I can describe to fix it with no code provided is to say that be sure that any pointer you provide to cusparse for output storage has the lowest 4 bits equal to zero. For 16 byte alignment, this is equivalent to the statement given in the programming guide: “the first address must be a multiple of the size”.

Hi @Robert_Crovella , thanks for your kind reply.

Since the actual code is a bunch of abstraction and templates which is difficult to share, I’ve made a simple snippet that resembles what I have in my code where the error occurs. I’ve omitted unnecessary “checkCudaError” codes for clarity. Please have a look.

using Scalar = double;
using ScalarPtr = Scalar*;
using Index = size_t;
using IntPtr = int*;						 

void device_assign(ScalarPtr dResult, Scalar val) noexcept
{
  cudaMemcpy(dResult, &val, sizeof(Scalar), cudaMemcpyHostToDevice);
}

struct CudaContext
{
  cublasHandle_t cublasHandle{nullptr};
  cusparseHandle_t cusparseHandle{nullptr};
};

struct SparseMatrixShape
{
  Index nRows{ 0 }; /*!< Number of rows*/
  Index nColumns{ 0 }; /*!< Number of columns*/
  Index nNonZeros{ 0 }; /*!< Number of non-zero entries*/
};

class SparseMatrixCsr
{
public:
  void toDeviceSparseMatrix()
  {
    // Allocate device variables
    IntPtr device_csrRowOffsets{nullptr};
    IntPtr device_csrColumnIndex{nullptr};
    ScalarPtr device_csrValues{nullptr};
	
    cudaMalloc(reinterpret_cast<VoidPtr *>(&device_csrRowOffsets), sizeof(int) * (shape_.nRows + 1));
    cudaMalloc(reinterpret_cast<VoidPtr *>(&device_csrColumnIndex), sizeof(int) * shape_.nNonZeros);
    cudaMalloc(reinterpret_cast<VoidPtr *>(&device_csrValues), sizeof(Scalar) * shape_.nNonZeros);
  
    // Copy from host to device
    cudaMemcpy(device_csrRowOffsets, csrRowOffsets_.data(), sizeof(int) * (shape_.nRows + 1), cudaMemcpyHostToDevice);
    cudaMemcpy(device_csrColumnIndex, csrColumnIndex_.data(), sizeof(int) * shape_.nNonZeros, cudaMemcpyHostToDevice);
    cudaMemcpy(device_csrValues, csrValues_.data(), sizeof(Scalar) * shape_.nNonZeros, cudaMemcpyHostToDevice);
    cusparseIndexBase_t cudaIndexBase = indexBase_ ? CUSPARSE_INDEX_BASE_ONE : CUSPARSE_INDEX_BASE_ZERO;
  
    // Build sparse matrix by cuSPARSE
    cusparseStatus_t cudaStatus = cusparseCreateCsr(
	  &deviceSparseMatrix_,
	  shape_.nRows,
	  shape_.nColumns,
	  shape_.nNonZeros,
	  device_csrRowOffsets,
	  device_csrColumnIndex,
	  device_csrValues,
	  CUSPARSE_INDEX_32I,
      CUSPARSE_INDEX_32I,
      cudaIndexBase,
      CUDA_R_64F);
  }
  
  auto getDeviceSparseMatrix() const noexcept { return deviceSparseMatrix_; }

  // other methods...

private:
  SparseMatrixShape shape_;
  cusparseSpMatDescr_t deviceSparseMatrix_{ nullptr }; /*!< cuSparse spMatrix descriptor on device*/
  int indexBase_{ 1 }; /*!< Sparse matrix index base*/
  std::span<int> csrRowOffsets_{}; /*!< Row offsets array, size = nRows_ + 1*/
  std::span<int> csrColumnIndex_{}; /*!< Column index array, size = nNonZeros_*/
  std::span<Scalar> csrValues_{}; /*!< Value array, size = nNonZeros_*/
  
  // other members...
}

void run_SpMV()	
{
  // Initialize cusparse context etc.
  CudaContext context;
  cudaStream_t stream{nullptr}; 
  
  cudahelper::createCuBLASContext(context.cublasHandle);
  cudahelper::createCuSPARSEContext(context.cusparseHandle);
  cudaStreamCreate(&stream);
  cublasStatus_t cublasStatus = cublasSetStream(context.cublasHandle, stream);
  cusparseStatus_t cusparseStatus = cusparseSetStream(context.cusparseHandle, stream);

  cublasSetPointerMode(context.cublasHandle, CUBLAS_POINTER_MODE_DEVICE);
  cusparseSetPointerMode(context.cusparseHandle, CUSPARSE_POINTER_MODE_DEVICE);
  
  // Allocate constants on device
  ScalarPtr device_constants{nullptr};
  ScalarPtr device_one{nullptr};
  ScalarPtr device_minusOne{nullptr};
  ScalarPtr device_zero{nullptr};
  
  cudaMalloc(&device_constants, 3 * sizeof(Scalar));
  device_one = device_constants++;
  device_zero = device_constants++;
  device_minusOne = device_constants++;
  device_constants = device_one;

  device_assign(device_one, 1.0);
  device_assign(device_zero, 0.0);
  device_assign(device_minusOne, -1.0);
  
  // Initialize matrixA ...
  SparseMatrixCsr matrixA;
  /*
    ... read A from files etc.
  */
  matrixA.toDeviceSparseMatrix();
  
  // Create host vector x and vector y
  std::vector<double> host_x;
  std::vector<double> host_Y;
  Index size{matrixA.nColumns};
  host_y.assign(size, 0.0);
  /*
    ... assign values to host_x from files etc.
  */
  
  // Create device vector device_x and vector device_y
  cusparseDnVecDescr_t device_x{nullptr}; 
  cusparseDnVecDescr_t device_y{nullptr}; 
  
  // Allocate device variables
  ScalarPtr device_dense_x{nullptr};
  ScalarPtr device_dense_y{nullptr};
  
  cudaMalloc(reinterpret_cast<VoidPtr *>(&device_dense_x), sizeof(Scalar) * size);
  cudaMalloc(reinterpret_cast<VoidPtr *>(&device_dense_y), sizeof(Scalar) * size);
  
  // Copy from host to device
  cudaMemcpy(device_dense_x, host_x.data(), sizeof(Scalar) * size, cudaMemcpyHostToDevice);
  cudaMemcpy(device_dense_y, host_y.data(), sizeof(Scalar) * size, cudaMemcpyHostToDevice);
  
  // Build dense vector by cuSPARSE
  cusparseCreateDnVec(&device_x, size, device_dense_x, CUDA_R_64F);
  cusparseCreateDnVec(&device_y, size, device_dense_x, CUDA_R_64F);

  // Make buffer for SpMV
  VoidPtr device_buffer{nullptr};
  size_t bufferSizeMV{0};
  cusparseStatus_t cusparseStatus = cusparseSpMV_bufferSize(
      cusparseHandle,
      CUSPARSE_OPERATION_NON_TRANSPOSE,
      device_one,
      matrixA.getDeviceSparseMatrix(),
      device_x,
      device_zero,
      device_y,
      CUDA_R_64F,
      CUSPARSE_SPMV_ALG_DEFAULT,
      &bufferSizeMV);
  cudaMalloc(&device_buffer, bufferSizeMV);
  
  // Excecute SpMV
  auto cusparseStatus = cusparseSpMV(
    cusparseHandle,
    CUSPARSE_OPERATION_NON_TRANSPOSE,
    device_one,
    matrixA.getDeviceSparseMatrix(),
    device_x,
    device_zero,
    device_y,
    CUDA_R_64F,
    CUSPARSE_SPMV_ALG_DEFAULT,
    device_buffer);

  // Free resources 
  // ..
}

The error I got from compute-sanitizer is as follows. It doesn’t show “address xxxxxx is misaligned”. I doubt that the part of code that allocates the constants on the device via pointer arithmetic is maybe wrong. What do you think? And the array allocated by cudaMalloc is guaranteed to be aligned, right? Thanks.


========= Invalid __global__ read of size 4 bytes
=========     at void cusparse::csrmv_v3_kernel<std::integral_constant<bool, (bool)0>, int, int, double, double, double, double, void>
                         (cusparse::KernelCoeffs<T7>, const T3 *, const T2 *, const T3 *, const T4 *, T2, int, int, const T5 *, T6 *, T3 *, T7 *)+0x21b0
=========     by thread (0,0,0) in block (64,0,0)
=========     Address 0xc083678e8 is out of bounds
=========     and is 185 bytes after the nearest allocation at 0xc08367800 of size 48 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuEventRecordWithFlags [0x7ffcbc5edc34]
=========                in C:\WINDOWS\system32\DriverStore\FileRepository\nvblwi.inf_amd64_77518d2b617eecb2\nvcuda64.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc83cbe003]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc83cbdec6]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc83cbc237]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc8356accf]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc835563db]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc83573729]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc83577d51]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpMV [0x7ffc834b3d46]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\cusparse64_12.dll
=========     Host Frame:cudasolver::utility::cuSparseSpmv<cudasolver::SparseMatrixCsr<std::vector<int,std::allocator<int> >,
                         std::vector<double,std::allocator<double> > >,cusparseDnVecDescr *,cusparseDnVecDescr *,double *> in \some_path\include\utility\cuutility.cuh:347 [0x84226]

Yes.

an out of bounds error is pretty much what it sounds like. It’s not a misalignment error.

I can’t compile your code and so wouldn’t be able to help any further. I suggest debugging. Find that exact line in your code that triggered this error, and study all the arguments to it.

Alright, thanks for the response. But what still confuses me is the the same code works with CUDA 12.3 but not with CUDA 12.4. This is so weird. Shouldn’t it be backward compatible?

bugs in CUDA are always possible. I can’t ascertain anything without code that I can test. For me, debugging is still the first step, rather than assuming that it is a bug in the compiler or toolchain or library.

Do as you wish, of course.

Thanks for the hints. I totally agree that debugging is the first thing that needs to be done. I’m actually trying my best to debug. As you know, it sometimes can be a real headache especially with a closed-source library such as cusparse. That’s why, at the same time, I just want to ask if there is other possible reasons that might lead to this bug, from your side.

Hi Robert,

It’s been a while but I still couldn’t figure out the cause of that error. However, I found out that the final function that threw the error is cusparseSpruneCsr2csr_bufferSizeExt which works on a single precision CSR matrix. But actually, the cusparseSpMV was called with a double precision matrix. So, shouldn’t cusparseDpruneCsr2csr_bufferSizeExt be called under the hood instead? Does it make sense to you? Thanks in advance.

Best regards,
Lu

Hi @417luke318. Can you give us more details of the issue:

  • Are you using legacy APIs or generic APIs? I see you’re using SpMV (which is a generic API) and cusparseSpruneCsr2csr_bufferSizeExt (a legacy API.
  • Can you run again with environment variable CUSPARS_LOG_LEVEL=5 and send us the logs?

Hi,

  1. I just called SpMV which is the generic API. The legacy function cusparseSpruneCsr2cse_bufferSizeExt, I think, was called within the SpMV kernel function, as shown in the compute-sanitizer log. I never called that function explicitly.

  2. Below is the last log before the error occured with CUSPARSE_LOG_LEVEL=5 on.

[2024-06-25 23:19:41][CUSPARSE][58008][Api][cusparseSpMV] handle[in]={ptr=0x1db28036a00, mode=DEVICE}, opA[in]=NON_TRANSPOSE, alpha[in]={value=1, ptr=0xc06f66e48}, matA[in]={format=CSR, ptr=0x1db2820e2b0, rows=334246, cols=334246, nnz=26022572, offsets=0xc06e20600, offsetType=32I, columnInd=0xc0ce00000, columnType=32I, values=0xc13200000, valuesType=CUDA_R_64F, baseIdx=ONE}, vecX[in]={ptr=0x1db2815b4f0, size=334246, values=0xc02400000, valuesType=CUDA_R_64F}, beta[in]={value=0, ptr=0xc06f66e50}, vecY[inout]={ptr=0x1db669d8c00, size=334246, values=0xc04200000, valuesType=CUDA_R_64F}, computeType[in]=CUDA_R_64F, alg[in]=SPMV_CSR_ALG1, externalBuffer[tmp]=0xc06f6ae00
[2024-06-25 23:19:41][CUSPARSE][58008][Trace][cusparseSpMV] vector_scalar_multiply_kernel<:type_n><<<653, 256, 0, 0x1db28188f30>>>()
[2024-06-25 23:19:41][CUSPARSE][58008][Trace][cusparseSpMV] csrmv_v3_kernel<:type_n><<<101651, 32, 0, 0x1db28188f30>>>()

Below is the log I got from compute-sanitizer (note that I now use CUDA 12.5 but the same error occurred in the same function cusparseSpMV). It is still an out-of-bound error. If you look at the backtrace, it basically flowed as cusparseSpMVcusparseZgemvi_bufferSizecusparseSpruneCsr2csr_bufferSizeExt. I only called cusparseSpMV with double-precision matrix. Any thought on it? I guess the answer is in cusparse::csrmv_v3_kernel. Thanks.

========= Invalid __global__ atomic of size 8 bytes
=========     at void cusparse::csrmv_v3_kernel<std::integral_constant<bool, (bool)0>, int, int, double, double, double, double, void>(cusparse::KernelCoeffs<T7>, const T3 *, const T2 *, const T3 *, const T4 *, T2, int, int, const T5 *, T6 *, T3 *, T7 *)+0x4da0
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x9234f6c68 is out of bounds
=========     and is 12.297.212.824 bytes before the nearest allocation at 0xc00480000 of size 65.536 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuEventRecordWithFlags [0x7ffbf26ad3a4]
=========                in C:\WINDOWS\system32\DriverStore\FileRepository\nvblwi.inf_amd64_f108b93fce536d1c\nvcuda64.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc15f7ccd3]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc15f7cb96]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpruneCsr2csr_bufferSizeExt [0x7ffc15f7af07]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc15825cff]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc1580dd33]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc1582eca9]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseZgemvi_bufferSize [0x7ffc15834b61]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:cusparseSpMV [0x7ffc15763d96]
=========                in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\cusparse64_12.dll
=========     Host Frame:xxx::cudasolver::utility::matrixVectorProductGPU<cusparseContext *,cusparseSpMatDescr *,cusparseDnVecDescr *> in D:\bb2\src\xxx\sandbox\plags\include\utility\utility.h:124 [0x8b032]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:xxx::cudasolver::detail::cuCGMainLoop<xxx::cudasolver::SparseMatrixCsr<std::span<int,-1>,std::span<double,-1> >,xxx::cudasolver::preconditioner::InvertBlockJacobiPreconditioner<xxx::cudasolver::SparseMatrixCsr<std::vector<int,std::allocator<int> >,std::vector<double,std::allocator<double> > >,30,0> > in D:\bb2\src\xxx\sandbox\plags\include\detail\cg_functions.h:99 [0x7ee47]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:xxx::cudasolver::algorithm::PreconditionedCUAlgorithm<xxx::cudasolver::preconditioner::InvertBlockJacobiPreconditioner<xxx::cudasolver::SparseMatrixCsr<std::vector<int,std::allocator<int> >,std::vector<double,std::allocator<double> > >,30,0>,xxx::cudasolver::algorithm::CGCUAlgorithm<10000,1024,32> >::run<xxx::cudasolver::SparseMatrixCsr<std::span<int,-1>,std::span<double,-1> >,xxx::cudasolver::DenseVector<std::span<double,-1> >,xxx::cudasolver::DenseVector<std::span<double,-1> > > in D:\bb2\src\xxx\sandbox\plags\include\preconditioned_cualgorithm.h:49 [0x7b820]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:xxx::cudasolver::tests::runMultipleIterations<xxx::cudasolver::CUsolver<xxx::cudasolver::algorithm::PreconditionedCUAlgorithm<xxx::cudasolver::preconditioner::InvertBlockJacobiPreconditioner<xxx::cudasolver::SparseMatrixCsr<std::vector<int,std::allocator<int> >,std::vector<double,std::allocator<double> > >,30,0>,xxx::cudasolver::algorithm::CGCUAlgorithm<10000,1024,32> > >,2> in D:\bb2\src\xxx\sandbox\plags\test\testUtils.h:146 [0x7b5f0]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:xxx::cudasolver::tests::____C_A_T_C_H____T_E_S_T____0 in D:\bb2\src\xxx\sandbox\plags\test\cgE2ETest.cpp:52 [0x78faf]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:Catch::RunContext::runCurrentTest in D:\bb2\src\xxx\extlib\catch2\catch.hpp:10910 [0x44fc2]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:Catch::RunContext::runTest in D:\bb2\src\xxx\extlib\catch2\catch.hpp:10681 [0x43575]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:Catch::`anonymous namespace'::runTests in D:\bb2\src\xxx\extlib\catch2\catch.hpp:11242 [0x46bb0]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:Catch::Session::runInternal in D:\bb2\src\xxx\extlib\catch2\catch.hpp:11444 [0x48401]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:main in D:\bb2\src\xxx\sandbox\plags\test\main.cpp:19 [0x5a92d]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:__scrt_common_main_seh in D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288 [0xafa48]
=========                in D:\bb2\src\xxx\sandbox\plags\lib\x64\Release\test.exe
=========     Host Frame:BaseThreadInitThunk [0x7ffd41bb257d]
=========                in C:\WINDOWS\System32\KERNEL32.DLL
=========     Host Frame:RtlUserThreadStart [0x7ffd42d2af28]
=========                in C:\WINDOWS\SYSTEM32\ntdll.dll

Hi @417luke318. Sorry for the late reply. Is it possible for you to send us the matrix so that we can reproduce the issue?
Also, is your vector aligned, i.e. the address is multiple of 16 bytes?

Hi @qanhpham . Thanks for the reply. I attached the matrix file together with the right-hand-side b vector and solution x vector(you may not need it) of the linear system that I try to solve with an iterative solver. It is a Ax= b solution. Basically, what I did is to solve this linear system with a conjugate gradient solver. The cusparseSpMv is called, as you probably know, during each solving iteration.

So I first solve this system with the CG solver, it works well. All cusparseSpMv and other cuda functions are successful. But when I try to solve it the second time (before which I reset the x = 0), the error occurred when calculate the initial residual r0 = b - Ax with x = 0. So y = Ax failed with the log that I posted previously.

I’ve printed out all the inputs(matrixA, in+out vectors, scalars(zero and one)) of cusparseSpMv, they are all correct. It is also hard to find anything strange from the log from setting CUSPARSE_LOG_LEVEL=5. I’m not sure if you could reproduce the bug without my code. But since it is a company project, unfortunately I don’t have the right to share the whole code. It works well with CUDA 12.3 (and previous versions). So I guess it is related to the change of cusparseSpMv from CUDA 12.3. BTW, I use CUSPARSE_SPMV_ALG_DEFAULT with 32I index type and CUDA_R_64F value type.

As for the alignment, I allocate all the vectors on the device with cudaMalloc(...) so should be correct. And I also checked the memory address of the input and output vectors. They are indeed aligned with 16 bytes.

If there is any other way that could help to find out the bug, please let me know. Thanks.

These are the matrix files: matrix files.

Hi @417luke318,
Thanks for the data and explanation. I’m investigating your use case.

But when I try to solve it the second time (before which I reset the x = 0 ), the error occurred

When you ran the 2nd SpMV, did you reuse the buffer used by the 1st SpMV or you allocated a new buffer? Can you try using the same buffer for all SpMV on the same matrix?