Issue with Destroying Idle MIG Instance After Previous Use on A100

GPU: A100-PCIE-40GB
Driver Version: 570.86.15
CUDA Version:12.5

Steps to Reproduce:

  1. Partition GPU into 3g.20gb and 4g.20gb MIG instances. 3g.20gb instance has GI id 2 and 4g.20gb instance has GI id 1.
    $ sudo nvidia-smi mig -i 0 -cgi 3g.20gb,4g.20gb -C 
    
  2. Run BlackScholes on both instances. I modified the constant NUM_ITERATIONS from 512 to 5120000 to make it a long-running CUDA applications.
  3. Stop the process on 4g.20gb, keeping 3g.20gb busy.
  4. Attempt to destroy 4g.20gb instance. This step is expected to succeed because 4g.20gb instance is now idle.
    $ sudo nvidia-smi mig -i 0 -dci -ci 0 -gi 1  # succeed
    Successfully destroyed compute instance ID  0 from GPU  0 GPU instance ID  1
    $ sudo nvidia-smi mig -i 0 -dgi -gi 1  # fail
    Unable to destroy GPU instance ID 1 from GPU 0: In use by another client  
    Failed to destroy GPU instances: In use by another client  
    

Question:

  • Why does the 4g.20gb instance remain “in use” even after stopping all processes?
  • How to destroy 4g.20gb instance correctly?