MIG performance

Hi,

I am currently testing the performance of an MIG-enabled A100 compared to a full A100 using a small neural network training benchmark that I expected to yield similar results. However, I am observing a significant difference in training speed, with the MIG-enabled A100 being approximately five times slower.

Here are the results:

  • Full A100 (slurm_benchmark-nn-training-a100_5044818.out):
    Training time: 0.00533 seconds per iteration
  • MIG-Enabled A100 (slurm_benchmark-nn-training-a100small_5044817.out):
    Training time: 0.02640 seconds per iteration

I would appreciate any insights into why this discrepancy might be occurring. Are MIG instances generally much slower when the GPU is split into 7 instances for example? Or would that point to a configuration issue?

Thank you!

import torch
import torch.nn as nn
import torch.optim as optim
import time

# Simple feedforward neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(1024, 2048)
        self.fc2 = nn.Linear(2048, 1024)
        self.fc3 = nn.Linear(1024, 512)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize model, data, and optimizer
model = SimpleNet().to(device)
data = torch.randn(64, 1024, device=device)  # Batch of 64
target = torch.randn(64, 512, device=device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())

# Warm-up
for _ in range(10):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

# Benchmark training
start = time.time()
for _ in range(1000):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
torch.cuda.synchronize()
end = time.time()

print(f"Training time: {(end - start) / 100:.5f} seconds per iteration")

Yes, they are. You have 1/7 of the original capability of the GPU in a single MIG instance.

1 Like

I thought this would only impact the bandwidth and not the troughput. Do you have an idea why the throuput is reduced, even if no one else is using the other 6/7th of the GPU?

From the horse’s mouth:

MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This gives administrators the ability to support every workload … with guaranteed quality of service (QoS)

Note “fully isolated”, “compute cores”, and “QOS”. MIG slices up a GPU, a user then gets to use such a slice, regardless of what else is going on in the system.

If I had to speculate: partitioning a GPU into fixed-sized slices is simpler than providing a dynamically shared resource by an order of magnitude. One guiding principle of GPU design is to minimize resource usage for control mechanism in favor of using resources for mechanisms that directly facilitate computation and data movement. In addition, guaranteeing a specified QOS in a dynamically shared resource is a hard problem, as I learned during my time working on networking products (CMTS; specialized edge router).

That is what puzzles me. The workload that I showed fits easily into a single slice. So, why would it run significantly slower in a slice than on a full GPU. Does this mean even if the neural network is so small it benefits from more streaming multiprocessors? If the frequency is not impacted?

If we partition a GPU into two identical slices, each slices comprises half of the GPU resources. With half of the resources, we would expect that the amount of work that can be completed by a single slice per unit time is half of what the full GPU could complete.

In other words, consider the GPU as a team of 10 ditch diggers, each with their own shovel and a pick-ax per each pair of ditch diggers, i.e. a total of ten shovels and 5 pick axes. The entire team can dig 100 feet of trench per hour.

If we now use MIG to partition our digging team into five equal team-slices (each with 2 diggers, 2 shovels, 1 pick ax), and let users rent such a team slice, how many feet of trench would we expect such a team-slice to dig per hour?

That picture is a bit misleading. The workload that I showed is only using a fraction of an A100. The workload fits --memory wise-- into a single slice. From your team of 10 diggers, 8-9 are just standing around doing nothing?

Lets say we have a team of 7 diggers to better match the architecture of the GPU. And I am running a workload that can be handled by one digger and the other six are just watching, why would that team outperform a single guy digging. Are they taking turns? Is the GPU able to use the memory more efficiently when doing forward and backward passes?

If you rent one team slice (2 diggers, 2 shovels, 1 pick ax), not only is it none of your business what the other 8 diggers are doing, you have no way of knowing. And your team-slice has no access to the shovels and pick axes of those diggers either. That is what “fully isolated” means.

The one team-slice you rented will dig 20 feet of trench per hour for you: that is the “guaranteed quality of service” you contracted and paid for.

You did not understand. I meant the case where you run the workload on the full GPU. The workload is not big enough to keep the GPU busy.

That is entirely possible. That could be an issue with me being provided less information than is available to you, or it could be an issue with me not correctly processing the information that was provided. Right now I am not sure which scenario applies here.

That presumably explains why you are seeing only a factor 5x slowdown rather than the theoretically expected 7x slowdown (full machine vs 1/7 slice).

In this video they say one slice of an A100 is comparable with one full V100.

However, with my simple workload I get:

Full A100: Training time: 0.00533 seconds per iteration
Full V100: Training time: 0.00843 seconds per iteration
A100slice: Training time: 0.02640 seconds per iteration

Is your workload exactly the same workload they are running in the video? I am not going to watch the video, but presumably their claim is “on this one particular workload, we observe that one 1/7 slice of the A100 provides the same performance as a full V100”.

Which is certainly not generalizable to “one 1/7 slice of the A100 provides the same performance as the full V100 for any workload you care to throw at it”. Different workloads will scale differently across GPU architectures, sometimes significantly so.

I can try to run BERT as they did in the video. Sure.

That would mean doing BERT inference on a full A100 would be around 7 times faster than on a V100? Is that roughly the case?

A100 vs V100 language model training speed, PyTorch

Looks like more like 3 times. Well same order of magnituted as 7 times.

With MIG you are dividing lots of resources, not only the computational, but also the memory resources including memory bandwidth and cache size.

With a small kernel running only on a single SM on the V100, it can still use the larger cache and larger memory bandwidth, with MIG this is not possible (on purpose).

For the purpose of this discussion, I would suggest we can focus on 3 top-level characteristics of a GPU to understand the effect of MIG:

  • memory size
  • memory bandwidth
  • the number of SMs

MIG potentially divides all 3 of these. To a first order approximation, a MIG GPU with 7 instances will provide 1/7 of the memory, 1/7 of the memory bandwidth, and 1/7 of the SMs in the full GPU, to each instance.

Just because a particular code “fits” in 1/7 of the memory does not mean that the other factors can be ignored. In fact, whereas memory size to a first order approximation instructs as to capability, because of the way a GPU works, both the number of SMs and the amount of memory bandwidth instruct as to performance, not capability. A CUDA code runs (on any CUDA GPU) in such a way that the number of SMs and the amount of memory bandwidth do not dictate whether the code will run or not.

I haven’t studied it carefully, but your presented code appears to be a 3-layer neural net, consisting of fully-connected layers. WIth a batch of 64 this is readily convertible (by CUDNN) into a sequence of matrix-multiply operations, to compute either forward or backward path through the network. The largest matrix multiply appears to have a size on the order of 2048x64, but a modern realization in pytorch would seek to use Tensorcore (TC) for this calculation.

The tensorcore units assigned to an SM do not handle particularly large matrix-multiply ops. Typically a matrix-multiply op handled by a tensorcore will be on the order of 16x16, and larger ops (such as 2048x64) will be broken down into a series of smaller ops. So we could imagine a 2048x64 matrix-matrix multiply to require many individual 16x16 TC ops, in order to synthesize the overall result.

The number of TC units in a A100 is a function of how many SMs are available; TC units belong to SMs. So with more SMs, I have more TC units. With fewer SMs, I have fewer. The actual number per SM is on the order of 4, but don’t quote me on that; look it up in the A100 whitepaper if you need that level of detail.

Anyway, an A100 will have ~100 SMs total, so a 1/7 MIG slice will have around 14. Hopefully I don’t need to carry this discussion much farther, and you should be able to determine that for an operation that makes significant use of TC, unless it is very small (and 2048x64 doesn’t fit that description), then having more TC units “available” will result in more performance.

I believe that statement needs to be interpreted, and considered in context. If you buy my argument about how to think about GPU capability and performance in 3 top level components, then we should be able to immediately start qualifying that statement. A full A100 does not have 7x the memory bandwidth of a full V100. It’s actually much closer to 2x. So if the code you are running has a significant dependency on memory bandwidth, I personally would not assume that this statement will apply in every case. A code that is memory bound would operate at full speed on a full A100, at approximately 1/7 speed on a 1/7 A100 MIG slice, and approximately half speed on a V100.

The statement should be interpreted in the context of the code they are studying. The code they are studying is BERT, and for a large BERT case, there will be a strong component of matrix-multiply operations. In this case, the ratio of a V100 to a full A100 is on the order of 300-1200 TFlops/Tops for A100 to 125 TFlops for V100. We are now starting to see a ratio that could be on the order of 7x, depending on the number and calculation representation you use exactly, for the BERT inference. If it is 7x, then 1/7 of a full A100 is going to give approximately the same TC throughput for that type of workload, compared to a full V100.

That is the way I would interpret that statement.

Regarding your actual code, I can’t say what its actual performance dependencies are, for sure, without performance measurement with a profiler. It might be that TC ops are not used (for instance, its not obvious to me that you are using tensor formats that are amenable to mixed precision work, but it may be that the optimizer is doing that, and I haven’t studied your code carefully), but that doesn’t radically alter the argument I have given here. The workload (2048x64) if decomposed to threads (e.g. for an “ordinary” matrix multiply) is certainly enough to fill more than 1/7 of an A100 GPU (i.e. the SMs thread capacity), so then running that sort of workload with reduced SMs is certainly going to affect the performance, even there.

I would say that the picture you posted here makes exactly that claim.

Let’s not mix training and inference. Inference (especially when trying to make comparisons like this, to show MIG in a most favorable light) will probably use a number format not typically used in training. On V100, roughly speaking, the only TC path is FP16. And with that you get a peak theoretical throughput of 125TF. On A100 there are more possible TC paths. WIth FP16, you get 300TF (non-sparse) on A100. Presumably that could be a “suitable” path for either training or inference. But if you focus only on inference, INT8 is an available path, and it gives 600TF (non-sparse). So 600TF vs 125TF is on the order of 5x, not 2x or 3x, and is getting closer to supporting the BERT inference claim. But INT8 is typically not used for training, so the training comparison you excerpted looks different.