accelerating 3 nested loops

ibm218 · May 24, 2017, 4:31pm

hello

I’ve written a connected components code that requires to scan the image 252 times to ensure that each detected shape has its own unique label

therefore, my code contains 3 for loops and I accelerated it with openACC

#pragma acc kernels loop
for (int z=0; z<252; z++)
{
#pragma acc loop
for (int i=0; i<width_image; i++)
{
#pragma acc loop
for (int j=0; j<height; j++)
{
" CODE "
}
}
}

so, the achieved performance is great with 94% achieved occupancy, but when testing on a tesla k40c, the achieved occupancy is 84%
I tried to increase the occupancy by adding directives with several configurations but the occupancy only got lower
any suggestions ?

MatColgrove · May 24, 2017, 7:29pm

Hi ibm218,

The occupancy is controlled by several factors such as the number of registers each thread uses, the amount of shared memory used per block, and the total number of warps used.

Given the occupancy went down when moving to a K40 (presumably from an older device?), my guess is that you don’t have enough warps to fully populate the device. I’d try collapsing the loops by removing the inner two “loop” directives and add “collapse(3)” on the outer loop. Depending on the size of height, you can also try just collapsing the outer two loops and setting “j” loop to be “vector”.

Some helpful information would be the output from “-Minfo=accel” to see what schedule the compiler assigned to the loops, the register and shared memory used via the “-ta=tesla:ptxinfo” flag, and the profiling output from setting the environment variable “PGI_ACC_TIME=1” to show the actual launch configuration.

Also, what are the values of “width_image” and “height”?

-Mat

ibm218 · May 25, 2017, 1:16pm

My GPU is a gtx 1060, and yes it has less cores than the Tesla k40.

when I compile, use maxregcount =32

and the “width_image” and “height” depend on the image, but usually it’s 581x429

this is my code if it’s needed, because I removed the loops and used “collapse(3)” and the code didn’t finish executing for more than a minute

#pragma acc data copy(label[:height][:width])
for(int z=0; z<252; z++)
{
#pragma acc kernels loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
if(label_[j] >0 && label[i-1][j-1] >0)
{
label[j] = min (label[j],label[i-1][j-1]);
label[i-1][j-1] = min (label[j],label[i-1][j-1]);
}
if(label[j] >0 && label[i-1][j] >0)
{
label[j] = min (label[j],label[i-1][j]);
label[i-1][j] = min (label[j],label[i-1][j]);
}
if(label[j] >0 && label[i-1][j+1] >0)
{
label[j] = min (label[i][j],label[i-1][j+1]);
label[i-1][j+1] = min (label[i][j],label[i-1][j+1]);
}
if(label[i][j] >0 && label[i][j-1] >0)
{
label[i][j] = min (label[i][j],label[i][j-1]);
label[i][j-1] = min (label[i][j],label[i][j-1]);
}

if(label[i][j] >0 && label[i][j+1] >0)
{
label[i][j] = min (label[i][j],label[i][j+1]);
label[i][j+1] = min (label[i][j],label[i][j+1]);
}

if(label[i][j] >0 && label[i+1][j+1] >0)
{
label[i][j] = min (label[i][j],label[i+1][j+1]);
label[i+1][j+1] = min (label[i][j],label[i+1][j+1]);
}
if(label[i][j] >0 && label[i+1][j] >0)
{
label[i][j] = min (label[i][j],label[i+1][j]);
label[i+1][j] = min (label[i][j],label[i+1][j]);
}
if(label[i][j] >0 && label[i+1][j-1] >0)
{
label[i][j] = min (label[i][j],label[i+1][j-1]);
label[i+1][j-1] = min (label[i][j],label[i+1][j-1]);
}
}

}

}_

MatColgrove · May 25, 2017, 3:25pm

Hi ibm218,

This code is different than what you posted earlier in that you’re only trying to offload the two inner loops, hence you want to try to collapse the two inner loops to increase occupancy.

However given these loops have dependencies and that you’re using “kernels” and “loop” without the “independent” clause, I would be surprised that the compiler is actually parallelizing these loops. What’s the output from “-Minfo=accel”?

In the actual code are you using “parallel” or adding “independent” to force parallelization? If so, are you getting correct answers? Given the forward and backward dependencies, I would highly doubt it.

Can you post or send to PGI Customer Service (trs@pgroup.com) are reproducing example?

-Mat

ibm218 · May 26, 2017, 1:35pm

Hi Mat,

I have modified the code to this:
#pragma acc data copy(label[:height][:width])
#pragma acc kernels loop num_workers(2) independent
for(int z=0; z<252; z++)

{

#pragma acc loop vector (1024)
for (int i=1; i<height; i++)
{
#pragma acc loop vector(1024)
for (int j=0; j < (width); j++)
{
“SAME CODE”
}
}
}

now i’m getting 98% occupancy on the GTX 1060, will try on K40c next monday.
code is running a bit faster now.

also, and as a follow up,
"
#pragma acc data copy(label[:height][:width])
for(int z=0; z<252; z++)
{
#pragma acc kernels loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
"
runs slower than
"
#pragma acc data copy(label[:height][:width])

#pragma acc kernels loop
for(int z=0; z<252; z++)
{
#pragma acc loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
"
and both had problems in the output on my test 15.75MPx image, but the new implementation is faster and perfectly accurate.

Thanks again sir
best regards,
ibrahim

MatColgrove · May 26, 2017, 8:06pm

Hi Ibrahim,

runs slower than

That doesn’t surprise me. In the first case, you’re just parallelizing the inner two loops that are launched 252 times. In the second case, you parallelize all three loops that are only launched once.

Note that it’s technically illegal OpenACC to nest a vector loop inside another vector loop. PGI accepts it with “kernels” since this is based on the old PGI Accelerator Model which allowed it, but since we’re deprecating this support and that GNU will flag it as an error, I’d recommend either collapsing these loops or use “worker” for the middle loop. Note that “1024x1024” wont fix on a NVIDIA GPU so I suspect the compiler is reducing these values. (See the output from -Minfo=accel to see what the compiler schedules).

but the new implementation is faster and perfectly accurate.

I’s still a bit surprised that you’re getting correct answers given the race conditions in your code, but am glad it’s working for you.

-Mat

Topic		Replies	Views
Is it possible use 4 nested loops with OpenACC? Legacy PGI Compilers	4	5478	November 7, 2013
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25161	December 11, 2014
a 3 levels of loop Legacy PGI Compilers	1	2085	September 6, 2012
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10464	May 2, 2014
How to parallelize this loop... Legacy PGI Compilers	14	7884	December 18, 2012
Help getting better performance with OpenACC Legacy PGI Compilers	2	1984	April 21, 2018
Triply nested loop using implicit OpenACC Legacy PGI Compilers	3	3049	September 5, 2012
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3585	October 4, 2021
Loop optimisation question Legacy PGI Compilers	5	5664	March 30, 2011
Add OpenACC to a Fortran loop Legacy PGI Compilers	5	7190	December 3, 2015

accelerating 3 nested loops

Related topics