Hi,
I am using OpenACC to parallelize 4 nested loops on a NVIDIA k20 GPU using PGI 14.4.
The problem I had to face is given by the fact that thread blocks are limited to 3 dimensions, thus I identified 2 solutions, but unfortunately, both of them present some limitations:
Solution A:
#pragma acc kernels present(...)
#pragma acc loop independent collapse(2) gang vector(DIM_BLOCK_Z)
for(t=0; t<nt; t++) {
for(z=0; z<nz; z++) {
#pragma acc loop independent gang vector(DIM_BLOCK_Y)
for(y=0; y<ny; y++) {
#pragma acc loop independent gang vector(DIM_BLOCK_X)
for(x=0; x < nx; x++) {
This solution should reproduce the behavior of an existing and working version written in CUDA which “collapse” the outer z and t dimensions on the third threads’ dimension.
The solution works if DIM_BLOCK_Z=1, but is giving wrong results if DIM_BLOCK_Z!=1 (the “wrong” result is interestingly always the same for every DIM_BLOCK_Z!=1 which I have tried).
My question is: am I trying to do something that is not allowed? I mean: trying to use the collapse clause on a part of the nested loops instead of using it on all of them.
Solution B:
#pragma acc kernels present(...)
#pragma acc loop independent collapse(4) gang vector(DIM_BLOCK)
for(t=0; t<nt; t++) {
for(z=0; z<nz; z++) {
for(y=0; y<ny; y++) {
for(hx=0; hx < nxh; hx++) {
This second solution attempts to use the collapse clause in a more “ordinary” way; it works, but forces to use 1D blocks… trying to add a worker(DIM_BLOCK_Y) to obtain 2D blocks as suggested here: https://forums.developer.nvidia.com/t/how-dose-pgi-manage-collapse-clause/134079/1 lead to this compilation error:
PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Load of NULL symbol
In conclusion, solution A seems to be limited to 2D blocks, while solution B seems to be limited to 1D blocks. Does someone knows the reasons of these limitations? Does someone knows about a solution C able to divide the 4 nested loops in 3D blocks?
Thanks in advance,
Enrico