I bought 1 NVLink device to connect 2 nvidia A40 graphics cards, used ubuntu20.04 system, downloaded and installed nvidia-driver-local-repo-ubuntu2004-515.105.01_1.0-1_ amd64 .deb driver from the official website, and then installed cuda11.8 (cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb) from the official website, After installing nvidia-fabricmanager-520_520.61.05-1_amd64.deb and nvidia-fabricmanager-dev-520_520.61.05-1_amd64.deb, start the fabricmanager service
(sudo systemctl start nvidia-fabricmanager) The following error message is reported: NV_WARN_NOTHING_TO_DO errors
Please help me to fix this error and start the nvidia-fabricmanager service correctly, thanks!
NVlinkError-fabricmanager-en1.docx (315.6 KB)
Fabric manager is not needed and should not be installed on a system that uses NVLink bridges. It is intended for use with systems that have NVSwitch functionality. A system with A40 GPUs, regardless of configuration or anything you do, does not have or use NVSwitch-es.
Thank you!But when I run the LLM of int8 of 70B, when I need a 70G video memory program, the system loads about 46G video memory of an A40 graphics card, and another A40 graphics card has not been started, and an error is reported, as follows:
torch.empty(self.output_size,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 0; 44.42 GiB total capacity; 43.11 GiB already allocated; 426.81 MiB free; 43.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Please help me!
Nvlink (or fabric manager) does not mean that when you attempt to use one GPU it will automatically spill over to the other GPU, nor does it mean that your A40 has twice the memory.
Your question is fundamentally about how to use pytorch with multiple GPUs. I suggest asking on pytorch forum (discuss.pytorch.org). This is not the right forum for that, and I won’t be able to help further with that.
Thank you!Then I ask this problem with this link" https://discuss.pytorch.org/t/2-nvidia-a40-but-show-error-torch-cuda-outofmemory-error/192037"
Today, I saw on the NVIDIA official website that A40 introduces that ultra fast GDDR6 memory can be expanded to 96GB through NVLink. But I couldn’t achieve 96GB after testing, so I think we should still solve this problem here
It’s not possible to transparently increase the memory size of a GPU using NVLink. That is not what is being communicated. There is no problem to solve as far as that goes. If you want to use the resources of multiple GPUs in pytorch, you’ll need to do that via pytorch.