vector_length() ignored by compiler?

Hi,

I wanted to assess the performance of a loop nest for different specifications of vector_length(), however, it seems that the compiler ignores the hint given by the vector_length() statement.

#pragma acc kernels
{
...
#pragma acc loop independent collapse(3) vector_length(32)
for(in i=0;i<Nx;++i){
  for(in j=0;j<Ny;++j){
    for(int k=0;k<Nz;++k){
// some finite difference stencil
    }
  }
}
...
}

No matter what value I put in vector_length(), the compiler always reports:

   681, Loop is parallelizable
    682, Loop is parallelizable
    683, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        681, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
        682,   /* blockIdx.x threadIdx.x collapsed */
        683,   /* blockIdx.x threadIdx.x collapsed */

Anything I am doing wrong there? Is it only possible if I use a parallel statement?

Thanks,
LS

Hi LS,

Is it only possible if I use a parallel statement?

In the 2.0 standard, “vector_length” only apples to a “parallel” construct. In the 2.5 standard, it was added to “kernels” as well. Currently, we accept the “vector_length” syntax on a “kernels” construct for C and Fortran but it will have no effect.

Note that “vector_length” applies to a compute construct and sets the vector width for all vector loops within that compute region. The “vector(width)” clause applies to a single loop. So here, you would want to use “vector(32)” instead of “vector_length(32)”.

Also note that with “kernels”, that the compiler may override your vector width if it decides to create multiple vector loops. Since you’re collapsing the loops, it should use your suggestion, but if not, you may need to move to the “parallel” construct where you have more control.

Finally, I generally recommend to not set vector widths since it will reduce performance portability.

Hope this helps,
Mat

Hi Mat,

thanks for elaborating on this topic. I had also tried “vector(32)” but it did not have any effect despite the collapse clause.
When referring to performance portability would that be more of an issue when switching between major targets like nvidia, radeon, xeon_phi etc. or also within a class of devices e.g. K20-K80? As I understand the more recent standard would allow the use of device_type() to apply certain optimal settings for a specific major target.
Do you also discourage setting vector widths for certain device_types?

Thanks,
LS

When referring to performance portability would that be more of an issue when switching between major targets like nvidia, radeon, xeon_phi etc. or also within a class of devices e.g. K20-K80?

Right now it’s more about the class of devices, but could be between the same class as well. For example between a C1030 and a K20, or a Knights Corner and a Knights Landing.

Do you also discourage setting vector widths for certain device_types?

No, that’s actually one of the best uses for device_type and why it was created. It’s more a lack of support for device_type on our end. It’s not been a high priority item since there’s been so few target devices. Now with our addition of a Multi-core CPU target it will become higher priority and something I’ve been advocating for.

  • Mat