Wave Quantization WMMA

In the Wave Quantization section of the Matrix Multiplication performance guide (Matrix Multiplication Background User's Guide - NVIDIA Docs), there is a following statement:

An NVIDIA A100 GPU has 108 SMs; in the particular case of 256x128 thread block tiles, it can execute one thread block per SM, leading to a wave size of 108 tiles that can execute simultaneously. Thus, GPU utilization will be highest when the number of tiles is an integer multiple of 108 or just below.

As far as I remember, and from what I could gather in the A100 spec (https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf), the max number of threads per SM is 2048 and the max number of threads per block is 1024, but the number of threads in a 256x128 thread block is 32768.

For 2048 threads per SM limitation, that means that a single SM can accommodate, at most, 2 thread blocks, each containing 1024 threads (64x32 thread block size). Therefore, I do not see how a 256x128 thread tile can fit onto a single SM without ignoring the 2048 threads/SM limitation. The example from the text seemed to multiply 32 blocks/SM limitation by 1024 threads/block to get 32768 threads/SM, which makes it look like a 256x128 thread block/tile can fit onto a single SM. With the 2048 thread/SM limitation, is my understanding correct that the 256x128 thread block tile would actually be spread across 16 SMs instead of just 1?

The threadblock tile refers to the chunk of data the threadblock will operate on. There is not necessarily a 1:1 correspondence between the number of elements in the tile and the number of threads in the block that is “servicing” that tile.

This idea seems fairly evident throughout the section 3 of that document - not just the portion you excerpted.

Towards the beginning of the section 3, you will see a description of the matrix being divided into tiles - this has data dimensions and elements in view, not threadblock dimensions.

There is a 1:1 correspondence between a tile and a threadblock, and so if the subdivision of the matrix by tile dimensions results in a number of threadblocks that is not whole-number or almost whole-number divisible (“just under”) by the number of SMs in the GPU, then you will potentially have a noticeable (i.e. measurable/visible in the profiler) under-utilization of the GPU due to “wave quantization”.

on the flip side, as you point out, there clearly could not be a sensible interpretation that the size of a threadblock or the resident threads on a SM is given by 256x128.

To make sure that I am understanding correctly, I think you are saying that # tiles = # thread blocks, but # elements/tile != # threads/block (at least not necessarily 1:1). That means that each thread in a thread block can process a sub-matrix of a tile that is assigned to a thread block. If that’s correct, then I think the statement that

in the particular case of 256x128 thread block tiles, it can execute one thread block per SM

is what’s confusing to me. If a single thread can process multiple elements, what is so special about having 256x128=32760 elements per SM such a good fit? Is it that it fits perfectly into the register files? Is it that the shape splits nicely to be processed by the 4 tensor cores in the SM?

Yes, that is what I get by reading the document from the start, i.e. starting here. For further confirmation we can observe that the document mentions tensorcore quite a bit, and has deep learning in view, so that means that even the smallest tensor-core (TC) ops (m8n8k4), are processed by a warp (32 threads) but are computing values associated with 64 output matrix elements. All TC ops are processed warp wide, none of them have a 1:1 relationship (ie. none are 32 elements:32 threads). AFAIK they are all 64 elements : 32 threads or larger operations. So 1:1 relationship between threads and elements is not in view here.

In the context of matrix multiplication, these are not the only considerations. An important factor is to load each element from global memory as few times as possible. You can get an idea of the typical mechanism (shared memory) used to help with this, by studying the tiled shared memory matrix multiplication code in the programming guide. Anyway, the implication is that we may need storage for as much as 2x the output tile “space” in shared memory, one for the A and one for the B array. These are not necessarily all the same data types with TC ops (for example we could have fp16 input but fp32 output), but the direction I’m headed here is that I think it is possible that shared memory can be a limiter to occupancy, just like threads can be. So my guess would be that the product of 256x128, taking into account datatype sizes (generally no larger than 2 bytes per element for input matrices for typical TC ops that are DL focused), and the need for one for each input array, probably results in a shared memory size that prevents occupancy of more than one threadblock in a A100 SM. A100 shared memory sizes max out at around 164KB/SM, and two bytes per input matrix, for an output matrix tile of 256x128 would be a total of 128KB of shared memory. So you could only have one of each of those threadblocks on a SM. That is probably the limiter in play here. Not threads.

To a large extent, that document you and I linked has cublas in view. So you could get further support for this idea from the profiler. You would first need to construct a cublas call that did a TC matrix-matrix multiply, probably of a fairly large size, and of the right datatypes. Then the profiler will tell you the kernel grid dimensions as well as the size of the shared memory used for each threadblock, and can even give you occupancy info.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.