Conceptual questions about maximizing compute intensity

I was reading the blog post CUTLASS: Fast Linear Algebra in CUDA C++, and got confused by the following paragraph in the subsection Warp Tile.

Warp Tile

Once data is stored in shared memory, each warp computes a sequence of accumulated matrix products by iterating over the K dimension of the thread block tile, loading submatrices (or fragments) from shared memory, and computing an accumulated outer product. Figure 4 shows a detailed view. The sizes of the fragments are typically very small in the K dimension to maximize the compute intensity relative to the amount of data loaded from shared memory, thereby avoiding shared memory bandwidth as a bottleneck.

Specifically,

  1. Why does compute intensity mean? Google brought me to Computational complexity, which is obviously not the same concept.
  2. How does a small K maximize compute intensity? On a CPU, accessing contiguous memory is fast thanks to the cache so we want to have a large K. Somehow the opposite is true on a GPU?
  3. If a small K is good for performance, why not simply let K = 1? There must be some tradeoff.

Any help is appreciated!

Compute intensity is defined by the quotient of computational operations / data accessed, commonly expressed in terms of floating-point operations per byte accessed (in short: FLOP/byte). The concept dates back to at least the early 1990s, at the time pertaining to super computers which were mostly based on vector processing (Cray etc).

Given that modern high-performance computing is often limited by available memory bandwidth (while floating-point operations are “too cheap to meter”), increasing the compute intensity of a computation is often a useful performance optimization strategy.

1 Like

Thanks! Could you elaborate on how changing K affects compute intensity? When K doubles, we access 2x more data but also do 2x more calculations since we are looping over the K dimension, so I think the compute intensity should stay constant. In that case, why does a small K improve performance?