Loop carried dependence of a->x prevents parallelization

When I used nested loops to implement complex operations, I encountered a problem

    cufftComplex* a = (cufftComplex*)malloc(Length_theta* M* sizeof(cufftComplex));
        #pragma acc kernels
	for(int i=0;i<Length_theta;i++)
	for(int j=0;j<M;j++)
	{
	double Theta=(-90+i)*deg2rad;
	a[i*M+j].x=cos(2*PI*f0*d*sin(Theta)/c*j);
	a[i*M+j].y=sin(2*PI*f0*d*sin(Theta)/c*j);
//	if(i==3&&j<10)std::cout <<Theta*180.0/PI<<'\t'<< a[i*M+j].x<<'\t'<<a[i*M+j].y <<'\t'<<'\n';
		
	}

The feedback on this part is

    157, Loop carried dependence of a->x prevents parallelization
         Loop carried dependence of a->x prevents vectorization
         Loop carried backward dependence of a->x prevents vectorization
    158, Loop is parallelizable
         Generating NVIDIA GPU code
        157, #pragma acc loop seq
        158, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

What I want to achieve is a total length_ Theta * M threads are calculated together because there is no data dependency between them, but it seems that the compiler does not understand it that way

With the “kernels” construct, the compiler must prove there are no dependencies in order to parallelize the loops. However since you’re using computed indices, the compiler can’t tell if the accesses to “a” are independent across loop iterations.

To fix, either use “kernels loop independent” or the “parallel” construct where “independent” is the default. “independent” asserts to the compiler that there are no dependencies.

Example:

   #pragma acc kernels loop independent collapse(2)
	for(int i=0;i<Length_theta;i++)
	for(int j=0;j<M;j++)
	{

or

   #pragma acc parallel loop collapse(2)
	for(int i=0;i<Length_theta;i++)
	for(int j=0;j<M;j++)
	{

Hope this helps,
Mat

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.