Unable to destroy compute instances after running collocated jobs with MIG on H100

TiesRobroek · August 10, 2023, 2:35pm

Hello!
After using PyTorch to train on two MIG compute instances at the same time, the compute instances become undestroyable:

unable to destroy compute instance id 0 from gpu 0 gpu instance id 3: in use by another client
failed to destroy compute instances: in use by another client

This issue doesn’t occur when I only train on one of the compute instances simultaneously.
The configuration of instances did not influence the behaviour, but this example was with a 2g.20gb and a 3g.40gb instance.

Interestingly, lsof | grep /dev/nvidia does not yield any results. When the instances are stuck this way (can’t be destroyed but can run on them) I cannot reset the GPU either.

I have not been able to find any process or handle that might block the destruction of the compute instances. I’ve originally raised this issue over at the DCGMI github (H100 MIG Instances failed to destroy after usage · Issue #95 · NVIDIA/DCGM · GitHub) but we were unable to find the source.

Thank you for your time!

kpatelczyk · August 16, 2023, 8:01pm

Hi Ties,

It does look like there is some outstanding user for the compute instances (possibly an internal one).

Could you run nvidia-bug-report.sh after you produce this state and attempt deletion? That will generate a log nvidia-bug-report.log.gz which may contain relevant info.

In case that doesn’t get us closer to figuring out what’s wrong, are you able to share code you use to reproduce the state?

TiesRobroek · August 22, 2023, 8:33am

Thank you for your response, @kpatelczyk . I’ve included the requested logs as an attachment.
nvidia-bug-report.log.gz (790.0 KB)

kpatelczyk · August 22, 2023, 6:51pm

Thanks Ties.

I do see in the log failure to destroy compute instance due to an outstanding reference to it. There is no indication of any error that would be causing bad refcounts, so going by the log it looks like there just is a user for it. Not being able to reset in such a state is expected - there can’t be GPU users for us to perform it.

I would expect the user show up in lsof with opened device node for the GPU though, which we didn’t see.

Could you try to shut down all the services that could be using the GPU before attempting the repro again? If you can verify no users exist, by unloading nvidia module, before re-attempting that would be great.

To unload the module you can do rmmod nvidia-uvm nvidia-drm nvidia-modeset nvidia.

TiesRobroek · September 11, 2023, 10:18am

Hey @kpatelczyk . Thank you for your help and I am sorry for the delay.
I have tried running your command, but it results in the following error:

rmmod: ERROR: Module nvidia_uvm is in use
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm

Should I try to force close the module?

Best,

kpatelczyk · September 11, 2023, 9:00pm

Force unloading is unlikely to resolve the issue. If there is still a user for the module, we may corrupt the system doing that. It may give some hints in dmesg/syslog who that user was though, so may be worth a shot.

Also, after you induce the state, can you use the following to show state of all tasks in the system: echo t > /proc/sysrq-trigger? It will be output to system log.
If we are stuck somewhere in uvm driver, this should give us some hints about that too.

Topic		Replies	Views
Issue with Destroying Idle MIG Instance After Previous Use on A100 CUDA Setup and Installation	0	32	February 20, 2025
H100 mig issue General Discussion	1	1430	March 7, 2024
"nvidia" Kernel Module REFUSES to Unload, No Matter What Linux driver , linux-driver	4	2675	November 14, 2024
No GPU Instance is Found - MIG Creatioin CUDA Setup and Installation cuda , kernel , ubuntu , gpu , nvidia-smi	2	2028	October 20, 2021
GPU memory cannot be released Deep Learning (Training & Inference)	0	1340	October 26, 2018
MIGs do not show, despite being created CUDA Setup and Installation	2	252	October 15, 2024
Failed to create GPU instances on H100 CUDA Setup and Installation	2	842	July 29, 2024
nvmlDeviceGetMigDeviceHandleByIndex return wrong MIG devices when some MIG devices deleted CUDA Programming and Performance cuda	3	946	November 3, 2021
H100 after disabling NVIDIA MIG - CUDA busy or unavailable General Discussion	0	592	March 8, 2024
MIG support in BCM , enablement failed and clients auto reset Base Command Manager	4	759	August 14, 2023

Unable to destroy compute instances after running collocated jobs with MIG on H100

Related topics