Hello,
can somebody please tell me how to parallelize the inner loop using the kernels construct?
The compiler keeps telling me that it is parallelizable but it refuses to parallelize it.
inline void matvec(const struct MatrixCRS* A, restrict const float* g, restrict float* y){
int i,j;
int n = A->n;
int nnz = A->nnz;
restrict int *ptr = A->ptr;
restrict int *index = A->index;
restrict float *value = A->value;
#pragma acc kernels present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n])
{
#pragma acc loop independent
for(i=0; i<n; i++){
float tmp = 0.0;
#pragma acc loop independent reduction(+:tmp)
for(j=ptr[i]; j<ptr[i+1]; j++){
tmp+=value[j]*g[index[j]];
}
y[i]=tmp;
}
}
}
The output is:
matvec:
54, Generating present(y[0:n])
Generating present(g[0:n])
Generating present(value[0:nnz])
Generating present(index[0:nnz])
Generating present(ptr[0:n+1])
Generating compute capability 2.0 binary
57, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Cached references to size [(x+1)] block of 'ptr'
CC 2.0 : 23 registers; 0 shared, 100 constant, 0 local memory bytes
60, Loop is parallelizable
If I read the compiler feedback correctly this parallel version parallelized the inner loop as well:
inline void matvec(const struct MatrixCRS* A, restrict const floatType* g, restrict floatType* y){
int i,j;
int n = A->n;
int nnz = A->nnz;
restrict int *ptr = A->ptr;
restrict int *index = A->index;
restrict floatType *value = A->value;
#pragma acc parallel present(ptr[0:n+1],index[0:nnz],value[0:nnz], g[0:n], y[0:n]) vector_length(32)
{
#pragma acc loop gang
for(i=0; i<n; i++){
floatType tmp = 0.0;
#pragma acc loop vector reduction(+:tmp)
for(j=ptr[i]; j<ptr[i+1]; j++){
tmp+=value[j]*g[index[j]];
}
y[i]=tmp;
}
}
}
This is the respective compiler feedback:
matvec:
54, Accelerator kernel generated
54, CC 2.0 : 20 registers; 32 shared, 92 constant, 0 local memory bytes
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(32) /* threadIdx.x */
54, Generating present(y[0:n])
Generating present(g[0:n])
Generating present(value[0:nnz])
Generating present(index[0:nnz])
Generating present(ptr[0:n+1])
Generating compute capability 2.0 binary
60, Loop is parallelizable
Best,
Paul