Multi Instance GPU (MIG) mode and Performance

I am experimenting with MIG Mode on A100 40 GB GPU.
GI Profile : MIG 3g.20gb (Profile ID 9)
Tested with BERT base model over TensorRT.

I am observing considerable increase in latency,

Is increase in latency expected in MIG mode?
Are there any suggestions/best practices for using MIG Mode ?

BERT is a model that could be complex enough that it saturates the A100 (without MIG). If that is the case, then switching inference to a MIG instance that is basically 1/2 of an A100 could result in longer processing time and therefore longer latency.

No latency increase is expected simply due to the usage of MIG, or not. But if the MIG instance you select cannot process the inference request in the same amount of time, then latency will increase.

For example, I would not expect very little latency difference in doing a single RN50 (batch size 1) inference on a “full” A100 vs. a MIG “instance” of A100. But for other more complex models there may be differences.

There is a MIG user guide available. Detailed TRT questions should be asked on the TRT forum.

You may also wish to review this for best practices, which will require sign-up/log-in.

1 Like