Number of Reads from the global memory and latency

Hi, this is a continuation of the matrix-multiplication problem described in my other thread.

In Ch3 of the NVIDIA Programming Guide, there are two sample matrix multiplication codes: one that uses shared memory and one that doesn’t. In the former, the elements in A are read from the global memory far less times than the one in the latter. However, does this even matter given that the threads are running in parallel? For example, I understand that in the naive matrix-multiplication, each element A(i,j) is read B_width times since A(i,j) is multiplied to B(j,k) for k = 1, 2, 3, … B_width. Is there a latency in this read if all the threads are reading the same element (which is A(i,j) in this case) concurrently?

Using shared RAM makes a big difference. Why? The main reason is poor locality (no coalescing). Reading the rows of A will coalesce, while reading the columns of B will take 16x more memory transactions.

Shared RAM also has higher bandwidth than global RAM. On Tesla 1060, it’s

0.5 ops/(bank * cycle) * 16 banks * 30 MPs * 1.3GHz * 4 bytes = 1160GiB/s

vs 95 GiB/s for global RAM.

(can someone tell me why shared RAM slower/bank than global)?

The inner loop of matrix multiply is clearly bandwidth limited, so every bit of bandwidth helps.

If the bandwidth is enough, there’s the question of latency. Assuming a 500 cycle latency, then you need 500/4 = 125 warps of work to hide the latency.

Assuming the hardware supports 32 concurrent warps:

ins.	  w0   w1  w2  w3 ...	w31

load	 0	  4	 8	12		 124

?		128							  252

?		256							  380

?		384							  508

use   512

There needs to be ~ 3 non dependent instructions between a load and dependent instruction in order to hide the latency. This might be feasible. Maybe you can check the PTX.