When I used nested loops to implement complex operations, I encountered a problem
cufftComplex* a = (cufftComplex*)malloc(Length_theta* M* sizeof(cufftComplex));
#pragma acc kernels
// if(i==3&&j<10)std::cout <<Theta*180.0/PI<<'\t'<< a[i*M+j].x<<'\t'<<a[i*M+j].y <<'\t'<<'\n';
The feedback on this part is
157, Loop carried dependence of a->x prevents parallelization
Loop carried dependence of a->x prevents vectorization
Loop carried backward dependence of a->x prevents vectorization
158, Loop is parallelizable
Generating NVIDIA GPU code
157, #pragma acc loop seq
158, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
What I want to achieve is a total length_ Theta * M threads are calculated together because there is no data dependency between them, but it seems that the compiler does not understand it that way
With the “kernels” construct, the compiler must prove there are no dependencies in order to parallelize the loops. However since you’re using computed indices, the compiler can’t tell if the accesses to “a” are independent across loop iterations.
To fix, either use “kernels loop independent” or the “parallel” construct where “independent” is the default. “independent” asserts to the compiler that there are no dependencies.
#pragma acc kernels loop independent collapse(2)
#pragma acc parallel loop collapse(2)
Hope this helps,
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.