Question about the tile directive in OpenMP and OpenAcc

Hi@all,

In OpenMP, and also in OpenACC there is the so-called tile directive.

In OpenMP it can be utilized as follows

#pragma omp parallel for

#pragma omp tile sizes(5, 16)

for (int i = 0; i < ROWS; ++i) { for (int j = 0; j < COLS; ++j) {

and, as far as i know, similarly for a gpu kernel…

#pragma omp target teams distribute parallel for

#pragma omp tile sizes(5, 16)

for (int i = 0; i < ROWS; ++i) { for (int j = 0; j < COLS; ++j) {

I have the following questions:

What happens with an edge case? i.e. if the loop border, ROWS and COLS in this example, is 21 and 33 and the tile size 5 and 16, or if ROWS and COLS are even just smaller than the tile size? Will this then result in undefined behavior? or do OpenMP and OpenACC automatically construct an edge case then?

  1. What are useful values for various CPU and various GPU models and manufacturers? Unfortunately, OpenMP and OpenACC has not much options to query the Hardware. Will it result in crashes or slowdowns if I choose a value that the hardware does not support?

I would like to use the tile directive in my code if that could be avoided… Openmp has an if clause for pragmas. Perhaps one can adapt the tile directive to certain hardware with this?

Hi schulz.benjamin,

What happens with an edge case?

Correctness-wise, it shouldn’t matter. If the tile size and the loop trip counts are different, it would get accounted for in loop bound of the tile or element loops. Here’s what it look like after tile is applied:

 for (int it = 0; it < ROWS; it += 5)            // tile loop  (i)
     for (int jt = 0; jt < COLS; jt += 16)         // tile loop  (j)
       for (int i = it; i < min(it+5,  ROWS); ++i) // element loop (i)
         for (int j = jt; j < min(jt+16, COLS); ++j) // element loop (j)

Performance-wise it may not be good since the parallelism would only apply to the outer loop, or the two outer loops if you added a “collapse(2)” clause. Also when offloaded to the GPU, full warps of 32 threads would be used, but some will not do any work. For example, you may have 5 blocks of 16 threads, so you’re wasting16 threads in the block and limiting the max blocks.

Personally, I don’t use “title” for GPUs and I’ve not seen a code where it helps. Not saying it wouldn’t do well for you, but you’d be the exception. Better to use a “collapse(2)” and let the compiler create a strip-mine loop for the threads.

-Mat

Hi mat, thank you for your answer.

I have now played around with this directive.

I can somewhat confirm what you say. I see no improvement, on gpu.

For cpu, the improvement is very small, but thanks…