From four nested loops to 3D blocks

Hi,
I am using OpenACC to parallelize 4 nested loops on a NVIDIA k20 GPU using PGI 14.4.
The problem I had to face is given by the fact that thread blocks are limited to 3 dimensions, thus I identified 2 solutions, but unfortunately, both of them present some limitations:

Solution A:

  #pragma acc kernels present(...)
  #pragma acc loop independent collapse(2) gang vector(DIM_BLOCK_Z)
  for(t=0; t<nt; t++) {
    for(z=0; z<nz; z++) {
      #pragma acc loop independent gang vector(DIM_BLOCK_Y)
      for(y=0; y<ny; y++) {
        #pragma acc loop independent gang vector(DIM_BLOCK_X)
        for(x=0; x < nx; x++) {

This solution should reproduce the behavior of an existing and working version written in CUDA which “collapse” the outer z and t dimensions on the third threads’ dimension.

The solution works if DIM_BLOCK_Z=1, but is giving wrong results if DIM_BLOCK_Z!=1 (the “wrong” result is interestingly always the same for every DIM_BLOCK_Z!=1 which I have tried).

My question is: am I trying to do something that is not allowed? I mean: trying to use the collapse clause on a part of the nested loops instead of using it on all of them.


Solution B:

  #pragma acc kernels present(...)
  #pragma acc loop independent collapse(4) gang vector(DIM_BLOCK)
  for(t=0; t<nt; t++) {
    for(z=0; z<nz; z++) {
      for(y=0; y<ny; y++) {
        for(hx=0; hx < nxh; hx++) {

This second solution attempts to use the collapse clause in a more “ordinary” way; it works, but forces to use 1D blocks… trying to add a worker(DIM_BLOCK_Y) to obtain 2D blocks as suggested here: https://forums.developer.nvidia.com/t/how-dose-pgi-manage-collapse-clause/134079/1 lead to this compilation error:

PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Load of NULL symbol

In conclusion, solution A seems to be limited to 2D blocks, while solution B seems to be limited to 1D blocks. Does someone knows the reasons of these limitations? Does someone knows about a solution C able to divide the 4 nested loops in 3D blocks?

Thanks in advance,

Enrico

Hi Enrico,

Solution A is technically illegal in OpenACC. Schedules of the same type can’t be nested. Instead, the standard’s committee created the “tile” clause to handle these cases in the 2.0 standard.

We’re still in the process of implementing “tile”, but allow nested schedule clauses as an extension. Mostly because we had this in the PGI Accelerator model and think it’s useful. Though other compilers can’t so your code wont be portable.

I’d suggest trying:

  #pragma acc kernels present(...) 
  #pragma acc loop independent gang
  for(t=0; t<nt; t++) { 
  #pragma acc loop independent vector(DIM_BLOCK_Z) 
    for(z=0; z<nz; z++) { 
      #pragma acc loop independent vector(DIM_BLOCK_Y) 
      for(y=0; y<ny; y++) { 
        #pragma acc loop independent vector(DIM_BLOCK_X) 
        for(x=0; x < nx; x++)

The equivalent “tile” version would be:

  #pragma acc kernels present(...) 
  #pragma acc loop independent gang
  for(t=0; t<nt; t++) { 
  #pragma acc loop independent vector tile(DIM_BLOCK_Z, DIM_BLOCK_Y, DIM_BLOCK_X) 
    for(z=0; z<nz; z++) {       
     for(y=0; y<ny; y++) { 
        for(x=0; x < nx; x++)

Note that the “Load of NULL symbol” issue is a generic compiler error. Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) so we determine the problem?

Thanks,
Mat

Thank you very much; the proposed solution is working fine and it allows to use 3D code blocks.

I will prepare anyway an example reproducing the compilation error of solution B and send it to PGI Customer Service.


Thanks,

Enrico