Am I translating this serial code to OpenACC right?

Hi! I would like to make sure that a certain part of my code is being correctly translated to parallel.

Here is the original code:

    for(i=0; i<bands; i++)
    {
		mean=0;
        for(j=0; j<N; j++)
			mean+=(image[(i*N)+j]);

		mean/=N;
        meanSpect[i]=mean;

        for(j=0; j<N; j++)
			image[(i*N)+j]=image[(i*N)+j]-mean;

	}

This is my OpenACC parallel code:

    #pragma acc parallel loop
    for(i=0; i<bands; i++)
    {
      mean=0;
  
      #pragma acc loop reduction(+:mean)
      for(j=0; j<N; j++)
  	    mean+=(image[(i*N)+j]);
  
      mean/=N;
      meanSpect[i]=mean;
  
      #pragma acc loop
      for(j=0; j<N; j++)
  	    image[(i*N)+j]=image[(i*N)+j]-mean;
	  }

As far as I understand everything is parallelizable as there’re no data dependencies, other than the “mean” variable, so there’s the thing. I believe that the first main for will be parallalized into a gang, one vector (thread) per iteration (or i value) right?

So each vector / thread will need its own mean variable, right? How do I tell them to do so? Is it OK now or do I have to do something else?

Thank you very much!

Gang will be applied to the outer “i” loop with vector applied to the inner “j” loops.

So each vector / thread will need its own mean variable, right? How do I tell them to do so? Is it OK now or do I have to do something else?

No, “mean” is private to the gang but shared amongst the vectors within each gang. However reductions are special in that each vector will have it’s own “mean” which is used to create a partial reduction. Additional code is then inserted after the loop to collect the partial reductions into a final reduction which is then stored back into the shared “mean” variable.

What’s the size of “bands” and “N”?

Inner reductions can cause overhead so if “bands” is large and “N” small (<64), you might try adding “gang vector” on the outer loop and not use “loop” on the inner loops. If “N” is larger then the code is probably fine as is. Though since it’s easy to experiment, you might try doing this anyway. Record the kernel time via profiling (either using Nsight-Compute or setting the environment variable “NV_ACC_TIME=1”) for each version with a representative data set, and then decide which to use.