Nvidia-fabricmanager running error with "NV_WARN_NOTHING_TO_DO"

tg16 · November 15, 2023, 6:01am

I bought 1 NVLink device to connect 2 nvidia A40 graphics cards, used ubuntu20.04 system, downloaded and installed nvidia-driver-local-repo-ubuntu2004-515.105.01_1.0-1_ amd64 .deb driver from the official website, and then installed cuda11.8 (cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb) from the official website, After installing nvidia-fabricmanager-520_520.61.05-1_amd64.deb and nvidia-fabricmanager-dev-520_520.61.05-1_amd64.deb, start the fabricmanager service
(sudo systemctl start nvidia-fabricmanager) The following error message is reported: NV_WARN_NOTHING_TO_DO errors
Please help me to fix this error and start the nvidia-fabricmanager service correctly, thanks！
NVlinkError-fabricmanager-en1.docx (315.6 KB)

Robert_Crovella · November 15, 2023, 2:46pm

Fabric manager is not needed and should not be installed on a system that uses NVLink bridges. It is intended for use with systems that have NVSwitch functionality. A system with A40 GPUs, regardless of configuration or anything you do, does not have or use NVSwitch-es.

tg16 · November 16, 2023, 9:18am

Thank you!But when I run the LLM of int8 of 70B, when I need a 70G video memory program, the system loads about 46G video memory of an A40 graphics card, and another A40 graphics card has not been started, and an error is reported, as follows:

torch.empty(self.output_size,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 0; 44.42 GiB total capacity; 43.11 GiB already allocated; 426.81 MiB free; 43.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Please help me!

Robert_Crovella · November 16, 2023, 4:14pm

Nvlink (or fabric manager) does not mean that when you attempt to use one GPU it will automatically spill over to the other GPU, nor does it mean that your A40 has twice the memory.

Your question is fundamentally about how to use pytorch with multiple GPUs. I suggest asking on pytorch forum (discuss.pytorch.org). This is not the right forum for that, and I won’t be able to help further with that.

tg16 · November 17, 2023, 4:14am

Thank you!Then I ask this problem with this link" https://discuss.pytorch.org/t/2-nvidia-a40-but-show-error-torch-cuda-outofmemory-error/192037"

tg16 · November 19, 2023, 4:32am

Today, I saw on the NVIDIA official website that A40 introduces that ultra fast GDDR6 memory can be expanded to 96GB through NVLink. But I couldn’t achieve 96GB after testing, so I think we should still solve this problem here

Robert_Crovella · November 20, 2023, 2:33pm

It’s not possible to transparently increase the memory size of a GPU using NVLink. That is not what is being communicated. There is no problem to solve as far as that goes. If you want to use the resources of multiple GPUs in pytorch, you’ll need to do that via pytorch.

Topic		Replies	Views
Problem starting fabricmanager in Ubuntu 20.04 LTS CUDA Setup and Installation	7	6711	November 2, 2024
Fabric Manager Installation CUDA Setup and Installation	3	8415	March 20, 2024
Fabric manager on VM returns error CUDA Setup and Installation cuda , ubuntu	0	598	April 12, 2024
Nvidia-fabricmanager Error on H100 SXM: received NVLink inband message arrived on an NVLink xx which is not part of any active partition InfiniBand/VPI Switch Systems hw , nvbugs , ai	0	40	October 18, 2024
Issue when upgrading cuda driver to R470 - DGX2 DGX User Forum cuda	17	6497	July 5, 2023
Nvidia fabric manger initializing CUDA H100 Drivers - Linux, Windows, MacOS cuda , nvbugs , python	1	304	July 4, 2024
CUDA device not initialized error on all calls, HGX A100, Centos 7 Linux cuda	9	4335	December 6, 2021
NCCL declaring Nvidia GPU missing using Pytorch distributed GPU-Accelerated Libraries boot , cuda , ubuntu , nvbugs	1	3420	February 7, 2023
P4d.24xlarge instances not reporting Fabric status Linux	0	279	February 25, 2024
HGX A100 VM passthrough issues on Ubuntu 20.04 Linux	6	4831	September 14, 2021

Nvidia-fabricmanager running error with "NV_WARN_NOTHING_TO_DO"

Related Topics