Hi, this is a continuation of the matrix-multiplication problem described in my other thread.

In Ch3 of the NVIDIA Programming Guide, there are two sample matrix multiplication codes: one that uses shared memory and one that doesn’t. In the former, the elements in A are read from the global memory far less times than the one in the latter. However, does this even matter given that the threads are running in parallel? For example, I understand that in the naive matrix-multiplication, each element A(i,j) is read B_width times since A(i,j) is multiplied to B(j,k) for k = 1, 2, 3, … B_width. Is there a latency in this read if all the threads are reading the same element (which is A(i,j) in this case) concurrently?