For the purpose of this discussion, I would suggest we can focus on 3 top-level characteristics of a GPU to understand the effect of MIG:
- memory size
- memory bandwidth
- the number of SMs
MIG potentially divides all 3 of these. To a first order approximation, a MIG GPU with 7 instances will provide 1/7 of the memory, 1/7 of the memory bandwidth, and 1/7 of the SMs in the full GPU, to each instance.
Just because a particular code “fits” in 1/7 of the memory does not mean that the other factors can be ignored. In fact, whereas memory size to a first order approximation instructs as to capability, because of the way a GPU works, both the number of SMs and the amount of memory bandwidth instruct as to performance, not capability. A CUDA code runs (on any CUDA GPU) in such a way that the number of SMs and the amount of memory bandwidth do not dictate whether the code will run or not.
I haven’t studied it carefully, but your presented code appears to be a 3-layer neural net, consisting of fully-connected layers. WIth a batch of 64 this is readily convertible (by CUDNN) into a sequence of matrix-multiply operations, to compute either forward or backward path through the network. The largest matrix multiply appears to have a size on the order of 2048x64, but a modern realization in pytorch would seek to use Tensorcore (TC) for this calculation.
The tensorcore units assigned to an SM do not handle particularly large matrix-multiply ops. Typically a matrix-multiply op handled by a tensorcore will be on the order of 16x16, and larger ops (such as 2048x64) will be broken down into a series of smaller ops. So we could imagine a 2048x64 matrix-matrix multiply to require many individual 16x16 TC ops, in order to synthesize the overall result.
The number of TC units in a A100 is a function of how many SMs are available; TC units belong to SMs. So with more SMs, I have more TC units. With fewer SMs, I have fewer. The actual number per SM is on the order of 4, but don’t quote me on that; look it up in the A100 whitepaper if you need that level of detail.
Anyway, an A100 will have ~100 SMs total, so a 1/7 MIG slice will have around 14. Hopefully I don’t need to carry this discussion much farther, and you should be able to determine that for an operation that makes significant use of TC, unless it is very small (and 2048x64 doesn’t fit that description), then having more TC units “available” will result in more performance.
I believe that statement needs to be interpreted, and considered in context. If you buy my argument about how to think about GPU capability and performance in 3 top level components, then we should be able to immediately start qualifying that statement. A full A100 does not have 7x the memory bandwidth of a full V100. It’s actually much closer to 2x. So if the code you are running has a significant dependency on memory bandwidth, I personally would not assume that this statement will apply in every case. A code that is memory bound would operate at full speed on a full A100, at approximately 1/7 speed on a 1/7 A100 MIG slice, and approximately half speed on a V100.
The statement should be interpreted in the context of the code they are studying. The code they are studying is BERT, and for a large BERT case, there will be a strong component of matrix-multiply operations. In this case, the ratio of a V100 to a full A100 is on the order of 300-1200 TFlops/Tops for A100 to 125 TFlops for V100. We are now starting to see a ratio that could be on the order of 7x, depending on the number and calculation representation you use exactly, for the BERT inference. If it is 7x, then 1/7 of a full A100 is going to give approximately the same TC throughput for that type of workload, compared to a full V100.
That is the way I would interpret that statement.
Regarding your actual code, I can’t say what its actual performance dependencies are, for sure, without performance measurement with a profiler. It might be that TC ops are not used (for instance, its not obvious to me that you are using tensor formats that are amenable to mixed precision work, but it may be that the optimizer is doing that, and I haven’t studied your code carefully), but that doesn’t radically alter the argument I have given here. The workload (2048x64) if decomposed to threads (e.g. for an “ordinary” matrix multiply) is certainly enough to fill more than 1/7 of an A100 GPU (i.e. the SMs thread capacity), so then running that sort of workload with reduced SMs is certainly going to affect the performance, even there.
I would say that the picture you posted here makes exactly that claim.
Let’s not mix training and inference. Inference (especially when trying to make comparisons like this, to show MIG in a most favorable light) will probably use a number format not typically used in training. On V100, roughly speaking, the only TC path is FP16. And with that you get a peak theoretical throughput of 125TF. On A100 there are more possible TC paths. WIth FP16, you get 300TF (non-sparse) on A100. Presumably that could be a “suitable” path for either training or inference. But if you focus only on inference, INT8 is an available path, and it gives 600TF (non-sparse). So 600TF vs 125TF is on the order of 5x, not 2x or 3x, and is getting closer to supporting the BERT inference claim. But INT8 is typically not used for training, so the training comparison you excerpted looks different.