Launch fails for sparse matrix-vector multiplication


I tried to do parallize a sparse matrix vector multiplication (CRS) with the PGI Accelerator. For a very small testcases (12x12) the kernel is launched sometimes (!), but not always. For bigger testcases it never works. In both cases the error message is:

call to cuMemcpyDtoH returned error 700: Launch failed

typedef double floatType;

struct MatrixCRS{
  int n;
  int nnz;
  int* ptr; 
  int* index;
  floatType* value;

// y <- A*x
void matvec(const struct MatrixCRS restrict* A, const floatType restrict* x, floatType restrict* y){
  int i,j;
  floatType restrict* value=A->value;
  int restrict* index = A->index;
  int restrict* ptr=A->ptr;
  const int nnz=A->nnz;
  const int n=A->n;
  #pragma acc region copyin(value[0:nnz-1], ptr[0:n], index[0:n], x[0:n-1]), copyout(y[0:n-1])
  for(i=0; i<A->n; i++){
    for(j=ptr[i]; j<ptr[i+1]; j++){
  }//parallel region

Im not sure if i did the copy to the device correct. The length of the fields are:
value: nnz
ptr: n+1
index: n+1
x: n
y: n

Is there a mistake in my code or is this an issue of the pgi compiler. I tried with 10.5 and 10.6 and a Tesla T10 Processor. The feedback looks like this:

$ pgcc -fastsse -DDEBUG -ta=nvidia,3.0,cc13 -Minfo -g -c solver.c
     40, Generating copyin(ptr[:n])
         Generating copyout(y[:n-1])
         Generating copyin(value[:nnz-1])
         Generating copyin(x[:n-1])
         Generating copyin(index[:n])
         Generating compute capability 1.3 binary
     43, Loop is parallelizable
         Accelerator kernel generated
         43, #pragma acc for parallel, vector(256)
             Cached references to size [257] block of 'ptr'
             Using register for 'y'
             CC 1.3 : 19 registers; 1052 shared, 92 constant, 0 local memory bytes; 75 occupancy
     45, Complex loop carried dependence of 'y' prevents parallelization
         Loop carried reuse of 'y' prevents parallelization
         Inner sequential loop scheduled on accelerator

Thanks for your help.

Kind regards,

Nobody knows why this happens?

Hi Tim,

Sorry I missed your post earlier.

I don’t see anything obvious. The error “call to cuMemcpyDtoH returned error 700: Launch failed” typically means that your kernel crashed during execution. So my best guess is that there is a memory access violation. For example if one of ptr’s “i” values is greater than the size of value or index. I’ve seen several programs that have these array bounds problems that ‘work’ on a CPU but crash on a GPU. Try compiling the CPU code with bounds checking enabled (-Mbounds) to see if anything shows up.

If this isn’t it, can you please post a driver program that I can use to recreate the problem?



sorry, i cant see any memory access violations. Compiling with -Mbounds does not change anything. I build a small driver, but i dont want to post it here. Is it possible to send in per mail?



I send the test driver to:

Could you look into it? Thank you for the support.

Hi Tim,

Customer support forwarded me you code. There’s a number of coding errors and I’ll send you back my changes to your ‘main.c’ file.

The above error is cause by ‘index’ being too small. The values of ‘j’ go from 0 to nnz-1, not 0 to n-1. Hence, index should be 'index[0:nnz-1]".

Hope this helps,

Hi Mat,

thanks. I wrote you an mail again. Unfortunately, this solves the problem only for the small test cases. For the big data sample I still get the ‘Launch failed’ error…

Hi Tim,

This one looks like a compiler error in 10.6. It runs fine for me when using 10.5 and the 10.8 release candidate. Unfortunately, there were a number of issues with 10.6 so if you can move back to using 10.5 or wait a few days til 10.8 is released, that would be great.


Hi Mat,

thanks for you quick help. I will go back to 10.5 until the new release.