I currently have a fairly simple kernel in CUDA that is running on the old 10.2 toolkit on a Titan V card. The kernel loads in a couple of fp16 matrices into shared memory and uses the tensor core to multiply them and stores them to shared memory before doing some processing. The operation is in a “for” loop that repeats four times, grabbing different data for the second matrix each time. When I put a “#pragma unroll” statement before the loop, the execution time increases by a small amount. However, if I put a “#pragma unroll 1” statement before the loop, the execution time decreases by a small but significant amount. This behavior is consistent over multiple profiling runs. As I understand from the programming guide, this form of the statement directs the compiler to not unroll the loop at all.
I am a bit confused by this behavior, given the fact that by unrolling the loops there are fewer conditional branch instructions to deal with. I am following the pattern outlined by the programming guide -
- I load in matrix_a outside the loop
- Inside the loop-
a. I use the fill_fragment call to zero out the accumulator
b. I use load_matrix_sync to load matrix_b
c. I call mma_sync
d. I store the accumulator to shared memory where it is further processed - Back outside the loop, I do some further processing on the accumulated results and store the resulting data to shared memory
If anyone can tell me why disabling loop unrolling for this case actually helps the execution time, I would appreciate it greatly.