cusparseScsrmv transpose mode is not working

Hi,

I am trying to use cusparseScsrmv to do some matrix vector multiplication usage.

While I am using cusparseScsrmv, the CUSPARSE_OPERATION_NON_TRANSPOSE mode is working fine, however when I use it with CUSPARSE_OPERATION_TRANSPOSE mode. Although cusparseScsrmv return the status as success. But my following cublas function call cublasSnrm2 ends up with error message “code=13(CUBLAS_STATUS_EXECUTION_FAILED) “cublasStatus””

If I transpose the matrix by myself, and call it with with CUSPARSE_OPERATION_NON_TRANSPOSE mode, it works fine.

Very strange behavior, anyone has similar experience?

Here is part of the code, which has the problem.

cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
     N, M, nnz, &alpha, descr, d_csc_val, d_csc_col, d_csc_row, d_u, &beta, d_v);

    //cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
    //    M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

    cudaDeviceSynchronize();
    checkCudaErrors(cusparseStatus);

    cublasStatus = cublasSnrm2(cublasHandle, N, d_v, 1, &a);
    cudaDeviceSynchronize();
    //printf("0 a = %12.12f \n", a);
    checkCudaErrors(cublasStatus);

maybe you simply have a format error in your transposed matrix.

Such errors may not be discovered until the cusparse algorithm actually runs. The cusparse function call may involve asynchronous activity, meaning that the function may return an error code (e.g. success in your case) before the algorithm has completed the matrix-vector multiplication. You can also sometimes get a sense of this by running your code with cuda-memcheck. If an illegal execution error occurs, it’s an indication that you may possibly have incorrect data in some fashion, that you are feeding into the cusparse call.

It’s impossible to be sure about that without a complete test case.

The transposed matrix call has no problem, if I transposed the matrix explicitly by myself.
cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
N, M, nnz, &alpha, descr, d_csc_val, d_csc_col, d_csc_row, d_u, &beta, d_v);

The non-tranposed matrix call is the one has a problem.
//cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
// M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

Strange part is, but cusparsestatus return as success after I call cudaDeviceSynchronize(). But failed at cublas funtion.

Let me try cuda-memcheck.

Here is the error message, looks like it hit a cudaFree problem:

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (110,0,0) in block (130,0,0)
========= Address 0xfff70650 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (109,0,0) in block (130,0,0)
========= Address 0xfff70094 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x000007a8 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (108,0,0) in block (130,0,0)
========= Address 0xfff70090 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Invalid shared read of size 4
========= at 0x00000a68 in void csrMvT_hyb_kernel<float, float, float, int=7, int=2, int=8, int=5, int=0>(cusparseCsrMvParams<float, float, float>, int*)
========= by thread (126,0,0) in block (75,0,0)
========= Address 0xfff70650 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x2043e5]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x359781]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x377323]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17d592]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x381a06]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17c5e1]
========= Host Frame:/omega2/env/cuda-8.0/lib64/libcusparse.so.8.0 [0x17e601]
========= Host Frame:./spmv [0x3e90]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:./spmv [0x38206]
========= Host Frame:./spmv [0x3e9b]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaFuncGetAttributes.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x3b9d79]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x74788]
========= Host Frame:/omega2/baselines/2018.1-192/linux_64/lib/libcublas.so.8.0 [0x77e7f]
========= Host Frame:./spmv [0x3ed0]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

CUDA error at spmv.cu:316 code=13(CUBLAS_STATUS_EXECUTION_FAILED) “cublasStatus”
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2efd93]
========= Host Frame:./spmv [0x38206]
========= Host Frame:./spmv [0x3ed7]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./spmv [0x3f91]

========= ERROR SUMMARY: 8 errors

One of the kernels in csrmv is failing. The cudaFree error is simply reflecting that some previous kernel execution error occurred. The problem is not due to cudaFree itself. The subsequent CUBLAS error is not really arising due to any CUBLAS activity, but is reflecting the fact that a previous kernel execution error occurred. The situation is approximately as I described. It may be due to misformatted input. It might be a bug in cusparse. I doubt further conclusions could be reached without a complete test case.

I try with small matrix, it works fine. cuda-memcheck doesn’t report any errors. But if I test with bigger matrix (4gb), it failed with those errors.

Any where I can upload the test case?

If the problem is as you describe, you should be able to demonstrate it in less than 100 lines of code. In that case, you can paste your code in this thread. Condense your code down to only what is necessary to demonstrate the problem. Be sure to use the code formatting option in the toolbar above your edit area.

Alternatively if you simply wish to report a bug, you can use the bug reporting portal at developer.nvidia.com

Here is the code. I think the input data size matters.

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

/* Using updated (v2) interfaces to cublas */
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cusparse.h>

// Utilities and system includes
#include <helper_functions.h>  // helper for shared functions common to CUDA Samples
#include <helper_cuda.h>       // helper function CUDA error checking and initialization


int main(int argc, char **argv)
{

    if (argc<3)
    {
        std::cout << "mat_file, rhs_file" << std::endl;
        exit(0);
    }

    int iarg=1;
    char* mat_filename = argv[iarg++];
    char* rhs_filename = argv[iarg++];

    printf("mat file:%s\n", mat_filename);
    printf("rhs file:%s\n", rhs_filename);

    int *nnz_rows;
    int M = 0, N = 0, nnz = 0, *I = NULL, *J = NULL;
    unsigned long long long_nnz=0;
    float *val = NULL;
    float *x;
    float *rhs;
    int *d_col, *d_row;
    float *d_val, *d_x, *d_v;

    float *d_rhs, *d_u;
    float alpha, beta;

    int header[3];

    FILE* file1;
    FILE* rhs_file;

    // This will pick the best possible CUDA capable device
    cudaDeviceProp deviceProp;
    int devID = findCudaDevice(argc, (const char **)argv);

    if (devID < 0)
    {
        printf("exiting...\n");
        exit(EXIT_SUCCESS);
    }

    checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));

    // Statistics about the GPU device
    printf("> GPU device has %d Multi-Processors, SM %d.%d compute capabilities\n\n",
           deviceProp.multiProcessorCount, deviceProp.major, deviceProp.minor);

    int version = (deviceProp.major * 0x10 + deviceProp.minor);

    if (version < 0x11)
    {
        printf("spmv: requires a minimum CUDA compute 1.1 capability\n");

        exit(EXIT_SUCCESS);
    }

    file1 = fopen(mat_filename, "rb");
    rhs_file = fopen(rhs_filename, "rb");

    fread(header, sizeof(int), 3, file1);
    fread(&long_nnz, sizeof(unsigned long long), 1, file1);

    M    = header[1]; N = header[2];

    printf("M=%d, N=%d\n", M, N);
    printf("long_nnz=%d\n", long_nnz);
    nnz = long_nnz;

    /* Generate a random tridiagonal symmetric matrix in CSR format */
    //M = N = 1048576;
    //nnz = (N-2)*3 + 4;
    //I = (int *)malloc(sizeof(int)*(M+1));
    I = (int *)malloc(sizeof(int)*(M+1));
    nnz_rows = (int *)malloc(sizeof(int)*(M));
    J = (int *)malloc(sizeof(int)*nnz);
    val = (float *)malloc(sizeof(float)*nnz);

    x = (float *)malloc(sizeof(float)*N);
    rhs = (float *)malloc(sizeof(float)*M);

    fread(header, sizeof(int), 2, rhs_file);

    printf("numrow=%d\n", header[1]);
    fread(rhs, sizeof(float), M, rhs_file);
    fclose(rhs_file);

    for (int i = 0; i < N; i++)
    {
        x[i] = 0.0f;
    }

    fread(nnz_rows, sizeof(int), M, file1);
    fread(J, sizeof(int), (nnz), file1);
    fread(val, sizeof(float), (nnz), file1);
    fclose(file1);

    memset(I,0,sizeof(int)*(M+1));

    for(int i=0;i<M;++i)
    {
        I[i+1] = I[i]+nnz_rows[i];
    }

    /* Get handle to the CUBLAS context */
    cublasHandle_t cublasHandle = 0;
    cublasStatus_t cublasStatus;
    cublasStatus = cublasCreate(&cublasHandle);

    checkCudaErrors(cublasStatus);

    /* Get handle to the CUSPARSE context */
    cusparseHandle_t cusparseHandle = 0;
    cusparseStatus_t cusparseStatus;
    cusparseStatus = cusparseCreate(&cusparseHandle);

    checkCudaErrors(cusparseStatus);

    cusparseMatDescr_t descr = 0;
    cusparseStatus = cusparseCreateMatDescr(&descr);

    checkCudaErrors(cusparseStatus);

    checkCudaErrors(cublasStatus);

    cusparseSetMatType(descr,CUSPARSE_MATRIX_TYPE_GENERAL);
    cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ZERO);

    checkCudaErrors(cudaMalloc((void **)&d_col , nnz*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_row , (M+1)*sizeof(int)));
    checkCudaErrors(cudaMalloc((void **)&d_val , nnz*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_rhs , M*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_u   , M*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_v   , N*sizeof(float)));
    checkCudaErrors(cudaMalloc((void **)&d_x   , N*sizeof(float)));

    checkCudaErrors(cudaMemcpy(d_col , J   , nnz*sizeof(int)   , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_row , I   , (M+1)*sizeof(int) , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_val , val , nnz*sizeof(float) , cudaMemcpyHostToDevice));

    checkCudaErrors(cudaMemcpy(d_x   , x   , N*sizeof(float)   , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_rhs , rhs , M*sizeof(float)   , cudaMemcpyHostToDevice));

    free(I);
    free(J);
    free(val);
    free(nnz_rows);

    alpha = 1.0f;
    beta = 0.0f;

    float a, b;

    //cublasStatus = cublasScopy(cublasHandle, M, d_rhs, 1, d_u, 1);
    checkCudaErrors(cudaMemcpy(d_u , d_rhs, M*sizeof(float)   , cudaMemcpyDeviceToDevice));
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    cublasStatus = cublasSnrm2(cublasHandle, M, d_u, 1, &b);
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    //printf("|b|=%12.12f\n", b);
    beta = b!=0.0f?1.0f/b:1.0f;
    //printf("1.0/|b|=%12.12f\n", beta);

    cublasStatus = cublasSscal(cublasHandle, M, &beta, d_u, 1);
    cudaDeviceSynchronize();
    checkCudaErrors(cublasStatus);

    cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE,
        M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_u, &beta, d_v);

    //cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
    //    M, N, nnz, &alpha, descr, d_val, d_row, d_col, d_v, &beta, d_u);

    cudaDeviceSynchronize();
    checkCudaErrors(cusparseStatus);

    cublasStatus = cublasSnrm2(cublasHandle, N, d_v, 1, &a);
    cudaDeviceSynchronize();
    //printf("0 a = %12.12f \n", a);
    checkCudaErrors(cublasStatus);

    cusparseDestroy(cusparseHandle);
    cublasDestroy(cublasHandle);

    cudaFree(d_col);
    cudaFree(d_row);
    cudaFree(d_val);
    cudaFree(d_x);
    cudaFree(d_u);
    cudaFree(d_v);
    cudaFree(d_rhs);
    free(x);
    free(rhs);

    exit(1);
}

If you want me to look at it, a few suggestions:

  1. remove any dependencies on external files. Generate whatever data you need programmatically.
  2. make the program self-contained and designed to run at the problem size that produces the problem.
  3. remove anything that isn’t necessary (e.g. extra cublas calls) to demonstrate the problem.
  1. it is very difficult for me. Any chance I can upload a data somewhere, and you can download it?

  2. is doable.

  3. is doable.

when I generate the matrix programmatically, there is no error anymore.

I suspect, this problem is some special case.

Any suggestion?

It looks like you are using CUDA 8.0

I would try to see if the problem still occurs with CUDA 9.2

I would also check that the matrix is in proper CSR format.

tried with cuda9.2, not working either.

I did convert my matrix to csc format, which is equivalent to a transpose matrix.

called csrSspmv with CUSPARSE_OPERATION_NON_TRANSPOSE mode, it works fine.

Can you elaborate how do I verified my csr matrix?

I use the same matrix with cpu code which call petsc and mkl, it works fine.

I don’t have a ready to offer test or test methodology to tell you how to verify the CSR format. It has to follow a certain set of rules for data organization, which are described in a number of places, one of which is the cusparse manual.

The verification process involves confirming that the supplied CSR matrix follows these rules.

If you like, you can file a bug at developer.nvidia.com

They will eventually ask for a full test case to reproduce the bug. At that point, since it seems to only happen with a particular matrix, you will need to provide that matrix somehow. If it can be uploaded, they will provide a methodology to upload the matrix for test.

When I tried to submit a bug at developer.nvidia.com, got this error message

“An error occurred while processing your request.”

Do I need to purchase some membership in order to submit a bug report?

No, but the web entry form is aggressive about attempting to prevent malicious usage.

My suggestion is to start by filing a very simple bug with a minimum of information, then after the bug is logged into the system, you should be able to add information to it.

Once you get a bug logged, if you want to provide the bug number assigned by the system, I can help get the bug organized if needed.

Ok, with some minimum text, it works.

https://developer.nvidia.com/nvidia_bug/2248744