tile gives a much lower compute utilization than collapse

ibm218 · August 8, 2017, 2:24pm

Hello,

I accelerated my similarity measurement kernel on the Tesla K40c using

#pragma acc kernels loop collapse (2) independent
#pragma acc kernels loop tile (16,16) independent

and when profiling the kernel performance limiters, I found that collapse gives 70% compute utilization while tile gives 54% when processing the same image. (memory utilization is 65% for each).

from my understanding is that because tile breaks the code into smaller tiles, less threads will actually be used to process the code and the rest of the threads will concentrate on hiding the latency.

is my theory correct?

MatColgrove · August 8, 2017, 3:05pm

Hi ibm218,

Since you’re not specifying the vector length with the collapse version, the compiler is most likely using 128 threads vs the tile example where you’re using 256. Though the total number of warps running on a SM unit should be the same in both cases (assuming you have enough compute and are fully utilizing the device) so the number of threads per gang (aka block) shouldn’t matter in terms of occupancy.

I’d look at the register usage and possibly amount of shared memory used (shown via the flag -ta=tesla:ptxinfo).

Can you post the loop as well as the compiler feedback messages (-Minfo=accel)?

-Mat

ibm218 · August 8, 2017, 6:09pm

these are the loops:

  
for (int i=1; i<(3500); i++)
    {
        for (int j=0; j <( 4500); j++)
        {

for the registers, I am using maxregcount=32 flag, and the tesla K40c has 960 double precision compute cores (64 per SM * 15SMs)

I will obtain the rest of the information tomorrow morning as I have no access to the Lab PC right now

ibm218 · August 9, 2017, 2:22am

This is the compiling output

I solved the utilization issue by removing the cuda8.0 flag, which has resulted in
collapse =80%
tile = 77%

but the memory utilization has fallen to 55% for collapse and remained 65% for tile.

for the execution speeds, collapse performs 1ms faster now (previously was 3ms).

is there a reason why collapse would perform faster ?

MatColgrove · August 9, 2017, 3:55pm

You should consider running both versions within PGPROF/NVPROF with analytics enabled. This should give you a more concrete answer as to the difference.

My best guess is that since you’re using a 2-D array, the collapse does better since the 128 vectors are all striding across the contiguous dimension (i.e. the row) and thus getting better cache utilization. With tile, your only using 16 vectors across the rows with the other 16 going across the columns.

One experiment to try, is tile sizes of 4x32 or 8x32 so that all the vectors in a warp (32 threads) are acting on the same row. With 16, half of the warp is processing one row, while the other 16 are on another row.

Note that by limiting the regcount to 32, you are getting some spilling. However, it’s only spilling 12 bytes so you should be ok. Spilling only becomes a problem when the spill size gets too big and starts spilling to global memory.

-Mat

ibm218 · August 11, 2017, 3:31am

the 4x32 did what I needed

However, I did not the " PGPROF/NVPROF with analytics enabled", do you mean I should use the "–analysis-metrics " or is there another flag I should use ?

MatColgrove · August 11, 2017, 3:38pm

Correct. “–analysis-metrics” enables metric analysis from the command line profiler. Or select the “analysis” tab and then the kernel to analyze when profiling interactively via the GUI version of the profiler.

One caveat is that in order to collect analysis data, kernels may need to be replayed multiple times. This can take a significant amount of time. To avoid this, you can select a single kernel to analyze via the “–kernel ” or limit the collected metrics via the “–metric ” options. To see a list of available metrics, run “pgprof --query-metrics”.

-Mat

ibm218 · August 12, 2017, 4:40pm

Thanks Mat, this helped me a lot.