Hello,
I tried to do parallize a sparse matrix vector multiplication (CRS) with the PGI Accelerator. For a very small testcases (12x12) the kernel is launched sometimes (!), but not always. For bigger testcases it never works. In both cases the error message is:
call to cuMemcpyDtoH returned error 700: Launch failed
typedef double floatType;
struct MatrixCRS{
int n;
int nnz;
int* ptr;
int* index;
floatType* value;
};
// y <- A*x
void matvec(const struct MatrixCRS restrict* A, const floatType restrict* x, floatType restrict* y){
int i,j;
floatType restrict* value=A->value;
int restrict* index = A->index;
int restrict* ptr=A->ptr;
const int nnz=A->nnz;
const int n=A->n;
#pragma acc region copyin(value[0:nnz-1], ptr[0:n], index[0:n], x[0:n-1]), copyout(y[0:n-1])
{
for(i=0; i<A->n; i++){
y[i]=0;
for(j=ptr[i]; j<ptr[i+1]; j++){
y[i]+=value[j]*x[index[j]];
}
}
}//parallel region
}
Im not sure if i did the copy to the device correct. The length of the fields are:
value: nnz
ptr: n+1
index: n+1
x: n
y: n
Is there a mistake in my code or is this an issue of the pgi compiler. I tried with 10.5 and 10.6 and a Tesla T10 Processor. The feedback looks like this:
$ pgcc -fastsse -DDEBUG -ta=nvidia,3.0,cc13 -Minfo -g -c solver.c
matvec:
40, Generating copyin(ptr[:n])
Generating copyout(y[:n-1])
Generating copyin(value[:nnz-1])
Generating copyin(x[:n-1])
Generating copyin(index[:n])
Generating compute capability 1.3 binary
43, Loop is parallelizable
Accelerator kernel generated
43, #pragma acc for parallel, vector(256)
Cached references to size [257] block of 'ptr'
Using register for 'y'
CC 1.3 : 19 registers; 1052 shared, 92 constant, 0 local memory bytes; 75 occupancy
45, Complex loop carried dependence of 'y' prevents parallelization
Loop carried reuse of 'y' prevents parallelization
Inner sequential loop scheduled on accelerator
Thanks for your help.
Kind regards,
Tim