private OpenACC clause on loop, kernels, and parallel constr

Hi,

After finding out that private clause in loop construct caused performance penalty, I had a question regarding PGI’s private clause interpretation.

According to OpenACC standard v1.0, private clause is allowed on parallel construct and loop construct, but not on kernels construct. And if private clause is on loop construct, the variables in private clause are supposed to be created at every iteration. Here is my question regarding “kernels” and “private” usage: If I want to declare explicitly a list of variable as private within a gang, but do not want to create per every iteration with kernels construct, what is the correct way to use those constructs and clause?

Thanks,

Youngsung

Hi Youngsung,

Do you mean that you want to create a variable that is private to a gang but shared amongst the vectors in a gang?

!$acc kernels
!$acc loop gang private(A)
do i=1, N
!$acc loop vector
  do j=1,M
 ...

Here A is private to each iteration of the “i” loop, but shared amongst the iterations of the “j” loop (i.e.the vectors).

  • Mat

Hi Mat,

Thanks for your kind explanations. It is good to know to put private clause on loop gang construct for vectors to share variables.

However, my situation is a bit more complicated. Please see my code below:

1 !$acc kernels
2 !$acc loop gang(ngangs) vector(neblk)
3 do ie=1,nelem
4 !$acc loop vector(npts) private(s1,s2,i,j,k,l)
5 do ii=1,npts
6 … computation using private variables and others

On line #4, I put private and it caused performance penalty.
On line #2, I have gang as well as vector. When I move private clause from line #4 to line #2, I saw approx. 10% performance improvement but had different computation result from previous one.

Actually, when I completely deleted private clause from source code, I was able to get the same result as well as 2X speed-ups. So, I am still confusing how PGI handles the private clause.

Thanks,

Youngsung

private(s1,s2,i,j,k,l)

These all look like scalars? By default scalars are made local to the generated kernel. This makes them private and has the added benefit that these variables are more likely to be put into a register.

When you add a scalar to a private clause, you are creating an array of these scalars in global memory, where each loop iteration has it’s own element (gang or vector). Since the variable is now in global memory, your code slows down.

I’ve talked with our compiler engineers about this and they agree that we need to rework this implementation. Essentially we should ignore scalars in a private clause when they are placed on a vector only loop and instead always make them local to the kernel. For a private on a gang loop, we should be using shared memory instead of global.

We’ll probably make this change once the proposed OpenACC 2.0 “default(none)” clause is implemented. Until then, the recommendation is not put scalars in private clauses unless absolutely necessary.

Hope this helps,
Mat

I’ve got clear idea now how it works!!! Thanks a lot, Mat.

If I try this with my code, I get complaints about live-out induction variables. Making the induction variable private to the loop stops the compiler from moaning, but apparently incurs a performance penalty. How can I inform the compiler that the variable “i” in a loop really doesn’t need to be remembered for the next loop over “i”?

You need to look how the induction variables are being declared and used. The clear case is when they are used on the right hand side after the compute region. Less clear cases are if they are declared global, used as arguments to a sub-routine, or have some other static storage. Sometimes branches can also cause this.

The quick method is to set the variables to some value immediately after the loop where they are used or use different induction variables. (i.e. “i=1”). But otherwise, this is a cause where the “private” clause may be necessary.

  • Mat