cusparseScsrmv transpose mode is not working

jiaokun · July 3, 2018, 1:27pm

Hi,

I am trying to use cusparseScsrmv to do some matrix vector multiplication usage.

While I am using cusparseScsrmv, the CUSPARSE_OPERATION_NON_TRANSPOSE mode is working fine, however when I use it with CUSPARSE_OPERATION_TRANSPOSE mode. Although cusparseScsrmv return the status as success. But my following cublas function call cublasSnrm2 ends up with error message “code=13(CUBLAS_STATUS_EXECUTION_FAILED) “cublasStatus””

If I transpose the matrix by myself, and call it with with CUSPARSE_OPERATION_NON_TRANSPOSE mode, it works fine.

Very strange behavior, anyone has similar experience?

jiaokun · July 3, 2018, 2:25pm

Here is part of the code, which has the problem.

cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
     N, M, nnz, &alpha, descr, d_csc_val, d_csc_col, d_csc_row, d_u, &beta, d_v);

    //cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
    //    M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

    cudaDeviceSynchronize();
    checkCudaErrors(cusparseStatus);

    cublasStatus = cublasSnrm2(cublasHandle, N, d_v, 1, &a);
    cudaDeviceSynchronize();
    //printf("0 a = %12.12f \n", a);
    checkCudaErrors(cublasStatus);

Robert_Crovella · July 3, 2018, 2:38pm

maybe you simply have a format error in your transposed matrix.

Such errors may not be discovered until the cusparse algorithm actually runs. The cusparse function call may involve asynchronous activity, meaning that the function may return an error code (e.g. success in your case) before the algorithm has completed the matrix-vector multiplication. You can also sometimes get a sense of this by running your code with cuda-memcheck. If an illegal execution error occurs, it’s an indication that you may possibly have incorrect data in some fashion, that you are feeding into the cusparse call.

It’s impossible to be sure about that without a complete test case.

jiaokun · July 3, 2018, 2:46pm

The transposed matrix call has no problem, if I transposed the matrix explicitly by myself.
cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
N, M, nnz, &alpha, descr, d_csc_val, d_csc_col, d_csc_row, d_u, &beta, d_v);

The non-tranposed matrix call is the one has a problem.
//cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
// M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

Strange part is, but cusparsestatus return as success after I call cudaDeviceSynchronize(). But failed at cublas funtion.

Let me try cuda-memcheck.

jiaokun · July 3, 2018, 3:37pm

Here is the error message, looks like it hit a cudaFree problem:

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (110,0,0) in block (130,0,0)
========= Address 0xfff70650 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (109,0,0) in block (130,0,0)
========= Address 0xfff70094 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (108,0,0) in block (130,0,0)
========= Address 0xfff70090 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x00000a68 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (126,0,0) in block (75,0,0)
========= Address 0xfff70650 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x381a06]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17c5e1]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:./spmv [0x38206]
========= Host Frame:./spmv [0x3e9b]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaFuncGetAttributes.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x3b9d79]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x74788]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x77e7f]
========= Host Frame:./spmv [0x3ed0]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

CUDA error at spmv.cu:316 code=13(CUBLAS_STATUS_EXECUTION_FAILED) “cublasStatus”
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:./spmv [0x38206]
========= Host Frame:./spmv [0x3ed7]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= ERROR SUMMARY: 8 errors

Robert_Crovella · July 3, 2018, 3:48pm

One of the kernels in csrmv is failing. The cudaFree error is simply reflecting that some previous kernel execution error occurred. The problem is not due to cudaFree itself. The subsequent CUBLAS error is not really arising due to any CUBLAS activity, but is reflecting the fact that a previous kernel execution error occurred. The situation is approximately as I described. It may be due to misformatted input. It might be a bug in cusparse. I doubt further conclusions could be reached without a complete test case.

jiaokun · July 3, 2018, 4:13pm

I try with small matrix, it works fine. cuda-memcheck doesn’t report any errors. But if I test with bigger matrix (4gb), it failed with those errors.

Any where I can upload the test case?

Robert_Crovella · July 3, 2018, 4:20pm

If the problem is as you describe, you should be able to demonstrate it in less than 100 lines of code. In that case, you can paste your code in this thread. Condense your code down to only what is necessary to demonstrate the problem. Be sure to use the code formatting option in the toolbar above your edit area.

Alternatively if you simply wish to report a bug, you can use the bug reporting portal at developer.nvidia.com

jiaokun · July 3, 2018, 4:35pm

Here is the code. I think the input data size matters.

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

/* Using updated (v2) interfaces to cublas */
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cusparse.h>

// Utilities and system includes
#include <helper_functions.h>  // helper for shared functions common to CUDA Samples
#include <helper_cuda.h>       // helper function CUDA error checking and initialization


int main(int argc, char **argv)
{

    if (argc<3)
    {
        std::cout << "mat_file, rhs_file" << std::endl;
        exit(0);
    }

    int iarg=1;
    char* mat_filename = argv[iarg++];
    char* rhs_filename = argv[iarg++];

    printf("mat file:%s\n", mat_filename);
    printf("rhs file:%s\n", rhs_filename);

    int *nnz_rows;
    int M = 0, N = 0, nnz = 0, *I = NULL, *J = NULL;
    unsigned long long long_nnz=0;
    float *val = NULL;
    float *x;
    float *rhs;
    int *d_col, *d_row;
    float *d_val, *d_x, *d_v;

    float *d_rhs, *d_u;
    float alpha, beta;

    int header[3];

    FILE* file1;
    FILE* rhs_file;

    // This will pick the best possible CUDA capable device
    cudaDeviceProp deviceProp;
    int devID = findCudaDevice(argc, (const char **)argv);

    if (devID < 0)
    {
        printf("exiting...\n");
        exit(EXIT_SUCCESS);
    }

    checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));

    // Statistics about the GPU device
    printf("> GPU device has %d Multi-Processors, SM %d.%d compute capabilities\n\n",
           deviceProp.multiProcessorCount, deviceProp.major, deviceProp.minor);

    int version = (deviceProp.major * 0x10 + deviceProp.minor);

    if (version < 0x11)
    {
        printf("spmv: requires a minimum CUDA compute 1.1 capability\n");

        exit(EXIT_SUCCESS);
    }

    file1 = fopen(mat_filename, "rb");
    rhs_file = fopen(rhs_filename, "rb");

    fread(header, sizeof(int), 3, file1);
    fread(&long_nnz, sizeof(unsigned long long), 1, file1);

    M    = header[1]; N = header[2];

    printf("M=%d, N=%d\n", M, N);
    printf("long_nnz=%d\n", long_nnz);
    nnz = long_nnz;

    /* Generate a random tridiagonal symmetric matrix in CSR format */
    //M = N = 1048576;
    //nnz = (N-2)*3 + 4;
    //I = (int *)malloc(sizeof(int)*(M+1));
    I = (int *)malloc(sizeof(int)*(M+1));
    nnz_rows = (int *)malloc(sizeof(int)*(M));
    J = (int *)malloc(sizeof(int)*nnz);
    val = (float *)malloc(sizeof(float)*nnz);

    x = (float *)malloc(sizeof(float)*N);
    rhs = (float *)malloc(sizeof(float)*M);

    fread(header, sizeof(int), 2, rhs_file);

    printf("numrow=%d\n", header[1]);
    fread(rhs, sizeof(float), M, rhs_file);
    fclose(rhs_file);

    for (int i = 0; i < N; i++)
    {
        x[i] = 0.0f;
    }

    fread(nnz_rows, sizeof(int), M, file1);
    fread(J, sizeof(int), (nnz), file1);
    fread(val, sizeof(float), (nnz), file1);
    fclose(file1);

    memset(I,0,sizeof(int)*(M+1));

    for(int i=0;i<M;++i)
    {
        I[i+1] = I[i]+nnz_rows[i];
    }

    /* Get handle to the CUBLAS context */
    cublasHandle_t cublasHandle = 0;
    cublasStatus_t cublasStatus;
    cublasStatus = cublasCreate(&cublasHandle);

    checkCudaErrors(cublasStatus);

    /* Get handle to the CUSPARSE context */
    cusparseHandle_t cusparseHandle = 0;
    cusparseStatus_t cusparseStatus;
    cusparseStatus = cusparseCreate(&cusparseHandle);

    checkCudaErrors(cusparseStatus);

    cusparseMatDescr_t descr = 0;
    cusparseStatus = cusparseCreateMatDescr(&descr);

    checkCudaErrors(cusparseStatus);

    checkCudaErrors(cublasStatus);

    cusparseSetMatType(descr,CUSPARSE_MATRIX_TYPE_GENERAL);
    cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ZERO);

    checkCudaErrors(cudaMalloc((void **)&d_col , nnz*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_row , (M+1)*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_val , nnz*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_rhs , M*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_u   , M*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_v   , N*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_x   , N*sizeof(float)));

    checkCudaErrors(cudaMemcpy(d_col , J   , nnz*sizeof(int)   , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_row , I   , (M+1)*sizeof(int) , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_val , val , nnz*sizeof(float) , cudaMemcpyHostToDevice));

    checkCudaErrors(cudaMemcpy(d_x   , x   , N*sizeof(float)   , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_rhs , rhs , M*sizeof(float)   , cudaMemcpyHostToDevice));

    free(I);
    free(J);
    free(val);
    free(nnz_rows);

    alpha = 1.0f;
    beta = 0.0f;

    float a, b;

    //cublasStatus = cublasScopy(cublasHandle, M, d_rhs, 1, d_u, 1);
    checkCudaErrors(cudaMemcpy(d_u , d_rhs, M*sizeof(float)   , cudaMemcpyDeviceToDevice));
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    cublasStatus = cublasSnrm2(cublasHandle, M, d_u, 1, &b);
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    //printf("|b|=%12.12f\n", b);
    beta = b!=0.0f?1.0f/b:1.0f;
    //printf("1.0/|b|=%12.12f\n", beta);

    cublasStatus = cublasSscal(cublasHandle, M, &beta, d_u, 1);
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
        M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

    //cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
    //    M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_v, &beta, d_u);

    cudaDeviceSynchronize();
    checkCudaErrors(cusparseStatus);

    cublasStatus = cublasSnrm2(cublasHandle, N, d_v, 1, &a);
    cudaDeviceSynchronize();
    //printf("0 a = %12.12f \n", a);
    checkCudaErrors(cublasStatus);

    cusparseDestroy(cusparseHandle);
    cublasDestroy(cublasHandle);

    cudaFree(d_col);
    cudaFree(d_row);
    cudaFree(d_val);
    cudaFree(d_x);
    cudaFree(d_u);
    cudaFree(d_v);
    cudaFree(d_rhs);
    free(x);
    free(rhs);

    exit(1);
}

Robert_Crovella · July 3, 2018, 4:47pm

If you want me to look at it, a few suggestions:

remove any dependencies on external files. Generate whatever data you need programmatically.
make the program self-contained and designed to run at the problem size that produces the problem.
remove anything that isn’t necessary (e.g. extra cublas calls) to demonstrate the problem.

jiaokun · July 3, 2018, 5:59pm

it is very difficult for me. Any chance I can upload a data somewhere, and you can download it?
is doable.
is doable.

jiaokun · July 3, 2018, 6:55pm

when I generate the matrix programmatically, there is no error anymore.

I suspect, this problem is some special case.

Any suggestion?

Robert_Crovella · July 7, 2018, 5:49am

It looks like you are using CUDA 8.0

I would try to see if the problem still occurs with CUDA 9.2

I would also check that the matrix is in proper CSR format.

jiaokun · July 9, 2018, 2:55pm

tried with cuda9.2, not working either.

I did convert my matrix to csc format, which is equivalent to a transpose matrix.

called csrSspmv with CUSPARSE_OPERATION_NON_TRANSPOSE mode, it works fine.

Can you elaborate how do I verified my csr matrix?

I use the same matrix with cpu code which call petsc and mkl, it works fine.

Robert_Crovella · July 9, 2018, 3:03pm

I don’t have a ready to offer test or test methodology to tell you how to verify the CSR format. It has to follow a certain set of rules for data organization, which are described in a number of places, one of which is the cusparse manual.

The verification process involves confirming that the supplied CSR matrix follows these rules.

If you like, you can file a bug at developer.nvidia.com

They will eventually ask for a full test case to reproduce the bug. At that point, since it seems to only happen with a particular matrix, you will need to provide that matrix somehow. If it can be uploaded, they will provide a methodology to upload the matrix for test.

jiaokun · July 9, 2018, 3:35pm

When I tried to submit a bug at developer.nvidia.com, got this error message

“An error occurred while processing your request.”

Do I need to purchase some membership in order to submit a bug report?

Robert_Crovella · July 9, 2018, 3:43pm

No, but the web entry form is aggressive about attempting to prevent malicious usage.

My suggestion is to start by filing a very simple bug with a minimum of information, then after the bug is logged into the system, you should be able to add information to it.

Once you get a bug logged, if you want to provide the bug number assigned by the system, I can help get the bug organized if needed.

jiaokun · July 9, 2018, 4:29pm

Ok, with some minimum text, it works.

https://developer.nvidia.com/nvidia_bug/2248744

Topic		Replies	Views
Memory read error when using csrmv with transpose operation Legacy PGI Compilers	8	3656	March 8, 2019
cusparse<t>csrmv_mp() when the operation is CUSPARSE_OPERATION_TRANSPOSE CUDA Programming and Performance	3	651	January 8, 2018
CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving... CUDA Programming and Performance	11	4269	February 28, 2011
op supported in cusparseSpMM GPU-Accelerated Libraries	0	334	February 24, 2020
Wrong result sparse matrix-vector multiplication Legacy PGI Compilers	5	3247	August 29, 2018
Problem of two large sparse matrices multiplication in cuParse CUDA Programming and Performance	6	3821	November 21, 2016
Keep getting wrong results from cusparseScsrmm () CUDA Programming and Performance	0	565	May 18, 2017
Problem in basic dense to csr format conversion using CUSPARSE GPU-Accelerated Libraries	3	980	July 28, 2015
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	314411	November 22, 2010
Bugs when trying to perform tranpose of a matrix using cuSPARSE GPU-Accelerated Libraries	2	787	October 12, 2021

cusparseScsrmv transpose mode is not working

Related topics