#pragma unroll not behaving as expected

I currently have a fairly simple kernel in CUDA that is running on the old 10.2 toolkit on a Titan V card. The kernel loads in a couple of fp16 matrices into shared memory and uses the tensor core to multiply them and stores them to shared memory before doing some processing. The operation is in a “for” loop that repeats four times, grabbing different data for the second matrix each time. When I put a “#pragma unroll” statement before the loop, the execution time increases by a small amount. However, if I put a “#pragma unroll 1” statement before the loop, the execution time decreases by a small but significant amount. This behavior is consistent over multiple profiling runs. As I understand from the programming guide, this form of the statement directs the compiler to not unroll the loop at all.
I am a bit confused by this behavior, given the fact that by unrolling the loops there are fewer conditional branch instructions to deal with. I am following the pattern outlined by the programming guide -

  1. I load in matrix_a outside the loop
  2. Inside the loop-
    a. I use the fill_fragment call to zero out the accumulator
    b. I use load_matrix_sync to load matrix_b
    c. I call mma_sync
    d. I store the accumulator to shared memory where it is further processed
  3. Back outside the loop, I do some further processing on the accumulated results and store the resulting data to shared memory

If anyone can tell me why disabling loop unrolling for this case actually helps the execution time, I would appreciate it greatly.

The number of conditional branch instructions is not the sole determiner of code performance. To understand what is going on,

(1) Look at the statistics output with nvcc command-line argument -Xptxas -v
(2) Look at the generated code with cuobjdump --dump-sass
(3) Collect run-time statistics from the CUDA profiler (make sure to include stats from the memory subsystem)

There will likely be some salient differences observable in the data from items (1) through (3) between the two configurations of the code, and by staring at them long enough and discovering interesting correlations, it should be possible to develop a mental model of what is happening with this code. By devising some additional experiments it should then be possible to gain more confidence in the validity of the mental model.

Trying to speculate based on a rough textual description of the code alone is not going to be very fruitful: I flunked clairvoyance class back in school, and I lost my magic eight-ball.