Quasi-Sequential Matrix Muliplication using Single Block


Can anyone please explain the surprisingly small difference in the execution times of following experiments:

  1. Execution time of 2 16x16 matrices using Single Block = 0.032 ms

  2. Execution time of 2 256x256 matrices using Single Block = 0.968 ms
    The resultant 256x256 matrix was calculated by running kernel 256 times, each time calculating a 16x16 block of the resultant matrix.

Shouldn’t the execution time for experiment 2 be roughly equal to 256 times the 1st experiment? Why is it so less?