I accelerated my similarity measurement kernel on the Tesla K40c using
#pragma acc kernels loop collapse (2) independent
#pragma acc kernels loop tile (16,16) independent
and when profiling the kernel performance limiters, I found that collapse gives 70% compute utilization while tile gives 54% when processing the same image. (memory utilization is 65% for each).
from my understanding is that because tile breaks the code into smaller tiles, less threads will actually be used to process the code and the rest of the threads will concentrate on hiding the latency.
is my theory correct?