Can anyone please explain the surprisingly small difference in the execution times of following experiments:
Execution time of 2 16x16 matrices using Single Block = 0.032 ms
Execution time of 2 256x256 matrices using Single Block = 0.968 ms
The resultant 256x256 matrix was calculated by running kernel 256 times, each time calculating a 16x16 block of the resultant matrix.
Shouldn’t the execution time for experiment 2 be roughly equal to 256 times the 1st experiment? Why is it so less?