Unable to destroy compute instances after running collocated jobs with MIG on H100

Hello!
After using PyTorch to train on two MIG compute instances at the same time, the compute instances become undestroyable:

unable to destroy compute instance id 0 from gpu 0 gpu instance id 3: in use by another client
failed to destroy compute instances: in use by another client

This issue doesn’t occur when I only train on one of the compute instances simultaneously.
The configuration of instances did not influence the behaviour, but this example was with a 2g.20gb and a 3g.40gb instance.

Interestingly, lsof | grep /dev/nvidia does not yield any results. When the instances are stuck this way (can’t be destroyed but can run on them) I cannot reset the GPU either.

I have not been able to find any process or handle that might block the destruction of the compute instances. I’ve originally raised this issue over at the DCGMI github (H100 MIG Instances failed to destroy after usage · Issue #95 · NVIDIA/DCGM · GitHub) but we were unable to find the source.

Thank you for your time!

Hi Ties,

It does look like there is some outstanding user for the compute instances (possibly an internal one).

Could you run nvidia-bug-report.sh after you produce this state and attempt deletion? That will generate a log nvidia-bug-report.log.gz which may contain relevant info.

In case that doesn’t get us closer to figuring out what’s wrong, are you able to share code you use to reproduce the state?

Thank you for your response, @kpatelczyk . I’ve included the requested logs as an attachment.
nvidia-bug-report.log.gz (790.0 KB)

Thanks Ties.

I do see in the log failure to destroy compute instance due to an outstanding reference to it. There is no indication of any error that would be causing bad refcounts, so going by the log it looks like there just is a user for it. Not being able to reset in such a state is expected - there can’t be GPU users for us to perform it.

I would expect the user show up in lsof with opened device node for the GPU though, which we didn’t see.

Could you try to shut down all the services that could be using the GPU before attempting the repro again? If you can verify no users exist, by unloading nvidia module, before re-attempting that would be great.

To unload the module you can do rmmod nvidia-uvm nvidia-drm nvidia-modeset nvidia.

Hey @kpatelczyk . Thank you for your help and I am sorry for the delay.
I have tried running your command, but it results in the following error:

rmmod: ERROR: Module nvidia_uvm is in use
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm

Should I try to force close the module?

Best,

Force unloading is unlikely to resolve the issue. If there is still a user for the module, we may corrupt the system doing that. It may give some hints in dmesg/syslog who that user was though, so may be worth a shot.

Also, after you induce the state, can you use the following to show state of all tasks in the system: echo t > /proc/sysrq-trigger? It will be output to system log.
If we are stuck somewhere in uvm driver, this should give us some hints about that too.