accelerating 3 nested loops

hello

I’ve written a connected components code that requires to scan the image 252 times to ensure that each detected shape has its own unique label

therefore, my code contains 3 for loops and I accelerated it with openACC

#pragma acc kernels loop
for (int z=0; z<252; z++)
{
#pragma acc loop
for (int i=0; i<width_image; i++)
{
#pragma acc loop
for (int j=0; j<height; j++)
{
" CODE "
}
}
}

so, the achieved performance is great with 94% achieved occupancy, but when testing on a tesla k40c, the achieved occupancy is 84%
I tried to increase the occupancy by adding directives with several configurations but the occupancy only got lower
any suggestions ?

Hi ibm218,

The occupancy is controlled by several factors such as the number of registers each thread uses, the amount of shared memory used per block, and the total number of warps used.

Given the occupancy went down when moving to a K40 (presumably from an older device?), my guess is that you don’t have enough warps to fully populate the device. I’d try collapsing the loops by removing the inner two “loop” directives and add “collapse(3)” on the outer loop. Depending on the size of height, you can also try just collapsing the outer two loops and setting “j” loop to be “vector”.

Some helpful information would be the output from “-Minfo=accel” to see what schedule the compiler assigned to the loops, the register and shared memory used via the “-ta=tesla:ptxinfo” flag, and the profiling output from setting the environment variable “PGI_ACC_TIME=1” to show the actual launch configuration.

Also, what are the values of “width_image” and “height”?

-Mat

My GPU is a gtx 1060, and yes it has less cores than the Tesla k40.

when I compile, use maxregcount =32

and the “width_image” and “height” depend on the image, but usually it’s 581x429

this is my code if it’s needed, because I removed the loops and used “collapse(3)” and the code didn’t finish executing for more than a minute

#pragma acc data copy(label[:height][:width])
for(int z=0; z<252; z++)
{
#pragma acc kernels loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
if(label_[j] >0 && label[i-1][j-1] >0)
{
label[j] = min (label[j],label[i-1][j-1]);
label[i-1][j-1] = min (label[j],label[i-1][j-1]);
}
if(label[j] >0 && label[i-1][j] >0)
{
label[j] = min (label[j],label[i-1][j]);
label[i-1][j] = min (label[j],label[i-1][j]);
}
if(label[j] >0 && label[i-1][j+1] >0)
{
label[j] = min (label[i][j],label[i-1][j+1]);
label[i-1][j+1] = min (label[i][j],label[i-1][j+1]);
}
if(label[i][j] >0 && label[i][j-1] >0)
{
label[i][j] = min (label[i][j],label[i][j-1]);
label[i][j-1] = min (label[i][j],label[i][j-1]);
}

if(label[i][j] >0 && label[i][j+1] >0)
{
label[i][j] = min (label[i][j],label[i][j+1]);
label[i][j+1] = min (label[i][j],label[i][j+1]);
}

if(label[i][j] >0 && label[i+1][j+1] >0)
{
label[i][j] = min (label[i][j],label[i+1][j+1]);
label[i+1][j+1] = min (label[i][j],label[i+1][j+1]);
}
if(label[i][j] >0 && label[i+1][j] >0)
{
label[i][j] = min (label[i][j],label[i+1][j]);
label[i+1][j] = min (label[i][j],label[i+1][j]);
}
if(label[i][j] >0 && label[i+1][j-1] >0)
{
label[i][j] = min (label[i][j],label[i+1][j-1]);
label[i+1][j-1] = min (label[i][j],label[i+1][j-1]);
}
}

}

}_

Hi ibm218,

This code is different than what you posted earlier in that you’re only trying to offload the two inner loops, hence you want to try to collapse the two inner loops to increase occupancy.

However given these loops have dependencies and that you’re using “kernels” and “loop” without the “independent” clause, I would be surprised that the compiler is actually parallelizing these loops. What’s the output from “-Minfo=accel”?

In the actual code are you using “parallel” or adding “independent” to force parallelization? If so, are you getting correct answers? Given the forward and backward dependencies, I would highly doubt it.

Can you post or send to PGI Customer Service (trs@pgroup.com) are reproducing example?

-Mat

Hi Mat,

I have modified the code to this:
#pragma acc data copy(label[:height][:width])
#pragma acc kernels loop num_workers(2) independent
for(int z=0; z<252; z++)

{

#pragma acc loop vector (1024)
for (int i=1; i<height; i++)
{
#pragma acc loop vector(1024)
for (int j=0; j < (width); j++)
{
“SAME CODE”
}
}
}

now i’m getting 98% occupancy on the GTX 1060, will try on K40c next monday.
code is running a bit faster now.

also, and as a follow up,
"
#pragma acc data copy(label[:height][:width])
for(int z=0; z<252; z++)
{
#pragma acc kernels loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
"
runs slower than
"
#pragma acc data copy(label[:height][:width])

#pragma acc kernels loop
for(int z=0; z<252; z++)
{
#pragma acc loop
for (int i=1; i<height; i++)
{
#pragma acc loop
for (int j=0; j < (width); j++)
{
"
and both had problems in the output on my test 15.75MPx image, but the new implementation is faster and perfectly accurate.

Thanks again sir
best regards,
ibrahim

Hi Ibrahim,

runs slower than

That doesn’t surprise me. In the first case, you’re just parallelizing the inner two loops that are launched 252 times. In the second case, you parallelize all three loops that are only launched once.

Note that it’s technically illegal OpenACC to nest a vector loop inside another vector loop. PGI accepts it with “kernels” since this is based on the old PGI Accelerator Model which allowed it, but since we’re deprecating this support and that GNU will flag it as an error, I’d recommend either collapsing these loops or use “worker” for the middle loop. Note that “1024x1024” wont fix on a NVIDIA GPU so I suspect the compiler is reducing these values. (See the output from -Minfo=accel to see what the compiler schedules).

but the new implementation is faster and perfectly accurate.

I’s still a bit surprised that you’re getting correct answers given the race conditions in your code, but am glad it’s working for you.

-Mat