I was reading the blog post CUTLASS: Fast Linear Algebra in CUDA C++, and got confused by the following paragraph in the subsection Warp Tile.
Warp Tile
Once data is stored in shared memory, each warp computes a sequence of accumulated matrix products by iterating over the K dimension of the thread block tile, loading submatrices (or fragments) from shared memory, and computing an accumulated outer product. Figure 4 shows a detailed view. The sizes of the fragments are typically very small in the K dimension to maximize the compute intensity relative to the amount of data loaded from shared memory, thereby avoiding shared memory bandwidth as a bottleneck.
Specifically,
- Why does compute intensity mean? Google brought me to Computational complexity, which is obviously not the same concept.
- How does a small K maximize compute intensity? On a CPU, accessing contiguous memory is fast thanks to the cache so we want to have a large K. Somehow the opposite is true on a GPU?
- If a small K is good for performance, why not simply let K = 1? There must be some tradeoff.
Any help is appreciated!