Hello,
I am currently interesting how to efficiently use “MIG” in a “Kubernetes” environment, and I have some questions regarding MIG instance reconfiguration.
According to the following documentation from NVIDIA:
NVIDIA MIG Manager For Kubernetes | NVIDIA NGC,
it appears that when using the MIG Manager, all workloads on a GPU must be stopped before the MIG configuration can be changed.
I’m wondering how this differs from an alternative method — where instead of using MIG Manager, we manually delete idle MIG instances and create new ones. For example, suppose I have a configuration with 4g.20gb + 2g.10gb
instances, and the 4g.20gb
instance is actively running a workload. If I want to reconfigure to 4g.20gb +1g.5gb + 1g.5gb
, it seems that using MIG Manager would require terminating the workload on the 4g.20gb
instance.
However, if I were to simply delete the idle 2g.10gb
instance and manually create two 1g.5gb
instances instead, would this cause any issues? Does this approach avoid the need to stop the running job?
Also, I’ve noticed that the Dynamic MIG feature mentioned in the NVIDIA Run:AI documentation(Version 2.19 -) has been deprecated. Since it seemed like a useful capability, I’d also like to understand why it was deprecated.
Thank you!