GPU: A100-PCIE-40GB
Driver Version: 570.86.15
CUDA Version:12.5
Steps to Reproduce:
- Partition GPU into
3g.20gb
and4g.20gb
MIG instances.3g.20gb
instance has GI id 2 and4g.20gb
instance has GI id 1.$ sudo nvidia-smi mig -i 0 -cgi 3g.20gb,4g.20gb -C
- Run BlackScholes on both instances. I modified the constant
NUM_ITERATIONS
from 512 to 5120000 to make it a long-running CUDA applications. - Stop the process on
4g.20gb
, keeping3g.20gb
busy. - Attempt to destroy
4g.20gb
instance. This step is expected to succeed because4g.20gb
instance is now idle.$ sudo nvidia-smi mig -i 0 -dci -ci 0 -gi 1 # succeed Successfully destroyed compute instance ID 0 from GPU 0 GPU instance ID 1 $ sudo nvidia-smi mig -i 0 -dgi -gi 1 # fail Unable to destroy GPU instance ID 1 from GPU 0: In use by another client Failed to destroy GPU instances: In use by another client
Question:
- Why does the
4g.20gb
instance remain “in use” even after stopping all processes? - How to destroy
4g.20gb
instance correctly?