I tested our MIG (Multi Instance GPU) on Jetson Thor with latest Jetpack 7.2. Here are few observations that things that worked and did not work. Sharing it for the benefit of others:
What does MIG do?
- Thor has 20 SM (streaming multiprocessor) cores. MIG splits the SM’s into isolated slices so that you could have dedicated workloads on each slice. Since Thor has unified memory, both workloads share the same memory.
- On Jetpack 7.2, the slices are fixed: one slice with 8 SM another slice with 12 SM.
- The workload I wanted to test out is, can I have one inference served from one slice and a different one from another slice. I wanted to check if the slicing helps in GPU contention since memory itself isnt a problem on thor, but single GPU is. (I know this could be different then the actual use-case people might want to use it for)
How to enable it?
- The documentation is pretty clear, you cannot enable MIG if GPU is being used. Which means first step is to disable gdm (UI), since its using GPU
sudo systemctl stop gdm
- MIG resets on a reboot, if you want a permanent split, run the following command. If you are just testing out in a session, this is optional
sudo nvidia-smi -pm 1 #optional
- Next, we will actually carry the split:
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -cgi 83,78 -C # ONLY these profiles work. Took this from release notes
nvidia-smi -L # This should show a split
- That’s it! Here is a sample output how it looks like on my Thor:
amar@localhost:~$ nvidia-smi -L
GPU 0: NVIDIA Thor (UUID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600)
MIG 2g.0gb Device 0: (UUID: MIG-c16cc329-4600-51d3-a578-e5c5bf35344e)
MIG 1g.0gb Device 1: (UUID: MIG-9f59caa3-c7a7-5270-b22e-d94cff3d9432)
How to test it?
- Once you have the split, you can use CUDA_VISIBLE_DEVICES to specify which slice you want to target.
- I used llama.cpp as an inference engine since my cursory research showed it respects the env variable.
CUDA_VISIBLE_DEVICES=<1g-uuid> llama-server -m gemma-12b.gguf --port 8082 &
CUDA_VISIBLE_DEVICES=<2g-uuid> llama-server -m qwen-27b.gguf --port 8081 &
Few gotchas to keep in mind
- If you want your instances to persist, ensure you are running
sudo nvidia-smi -pm 1 - While launching workloads, ensure that smaller slice’s workload gets launched first and then do the larger slice.
- If you start larger slice first I observed that the smaller slice hangs at CUDA init indefinitely. Reverse works fine.
nvidia-smiis not correct as of R39.2 release notes. There is a known issue 6162096 — “Output of nvidia-smi is incorrect when using MIG.”- This gives you incorrect information of the SM split being 12 and 6
- Teardown: ordering matters:
# Stop everything running on the slices
pkill -f llama-server # or whatever holds contexts
# Destroy compute instances, then GPU instances (order matters)
sudo nvidia-smi mig -dci
sudo nvidia-smi mig -dgi
# Disable MIG mode → whole 20-SM GPU, takes effect immediately, no reboot
sudo nvidia-smi -i 0 -mig 0
# Bring the desktop back if you want it
sudo systemctl start nvargus-daemon gdm
