Hello,

can somebody please tell me how to parallelize the inner loop using the kernels construct?

The compiler keeps telling me that it is parallelizable but it refuses to parallelize it.

```
inline void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
int i,j;
int n = A->n;
int nnz = A->nnz;
restrict int *ptr = A->ptr;
restrict int *index = A->index;
restrict float *value = A->value;
#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
{
#pragma acc loop independent
for(i=0; i<n; i++){
float tmp = 0.0;
#pragma acc loop independent reduction(+:tmp)
for(j=ptr[i]; j<ptr[i+1]; j++){
tmp+=value[j]*g[index[j]];
}
y[i]=tmp;
}
}
}
```

The output is:

```
matvec:
54, Generating present(y[0:n])
Generating present(g[0:n])
Generating present(value[0:nnz])
Generating present(index[0:nnz])
Generating present(ptr[0:n+1])
Generating compute capability 2.0 binary
57, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Cached references to size [(x+1)] block of 'ptr'
CC 2.0 : 23 registers; 0 shared, 100 constant, 0 local memory bytes
60, Loop is parallelizable
```

If I read the compiler feedback correctly this parallel version parallelized the inner loop as well:

```
inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
int i,j;
int n = A->n;
int nnz = A->nnz;
restrict int *ptr = A->ptr;
restrict int *index = A->index;
restrict floatType *value = A->value;
#pragma acc parallel present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n]) vector_length(32)
{
#pragma acc loop gang
for(i=0; i<n; i++){
floatType tmp = 0.0;
#pragma acc loop vector reduction(+:tmp)
for(j=ptr[i]; j<ptr[i+1]; j++){
tmp+=value[j]*g[index[j]];
}
y[i]=tmp;
}
}
}
```

This is the respective compiler feedback:

```
matvec:
54, Accelerator kernel generated
54, CC 2.0 : 20 registers; 32 shared, 92 constant, 0 local memory bytes
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(32) /* threadIdx.x */
54, Generating present(y[0:n])
Generating present(g[0:n])
Generating present(value[0:nnz])
Generating present(index[0:nnz])
Generating present(ptr[0:n+1])
Generating compute capability 2.0 binary
60, Loop is parallelizable
```

Best,

Paul