Hello,
I am currently working with two DGX-A100 servers and aiming to perform multi-node training. To ensure both servers have identical environments with the latest versions, I attempted to use nvidia-release-upgrade
. However, both servers return the following message when I try to upgrade, even though one server has a lower DGX_SWBUILD_VERSION compared to the other.
Could you please explain why this is happening?
jsh@pnode14:~/workspaces/runtimes$ sudo nvidia-release-upgrade
Adding component(s) 'restricted' to all repositories.
Press [ENTER] to continue or Ctrl-c to cancel.
Hit:1 https://nvidia.github.io/libnvidia-container/stable/deb/amd64 InRelease
Hit:2 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease
Hit:3 http://us.archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:5 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease
Hit:6 http://us.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Fetched 129 kB in 2s (68.5 kB/s)
Reading package lists... Done
Package: *
Pin: release l=Base OS Repository
Pin-Priority: 600
Package: *
Pin: release l=Base OS Updates Repository
Pin-Priority: 600
Package: *
Pin: release l=NVIDIA CUDA
Pin-Priority: 580
Package: cuda-drivers
Pin: release l=NVIDIA CUDA
Pin-Priority: -1
Package: nsight-compute
Pin: origin *ubuntu.com*
Pin-Priority: -1
Package: nsight-systems
Pin: origin *ubuntu.com*
Pin-Priority: -1
Package: nvidia-fabricmanager-*
Pin: origin *ubuntu.com*
Pin-Priority: 600
Package: libnvidia-nscq-*
Pin: origin *ubuntu.com*
Pin-Priority: 600
Checking for a new Ubuntu release
There is no development version of an LTS available.
To upgrade to the latest non-LTS development release
set Prompt=normal in /etc/update-manager/release-upgrades.
I have attached the contents of the /etc/dgx-release
files from both servers for your reference.
Thank you in advance for your assistance!
jsh@pnode4:~/workspaces/runtimes$ cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2023-05-16-16-18-31"
DGX_SWBUILD_VERSION="6.0.11"
DGX_COMMIT_ID="d0b730d"
DGX_PLATFORM="DGX Server for DGX A100"
DGX_SERIAL_NUMBER=
jsh@pnode14:~/workspaces/runtimes$ cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2022-10-11-17-49-32"
DGX_SWBUILD_VERSION="5.4.1"
DGX_COMMIT_ID="38d36e8"
DGX_PLATFORM="DGX Server for DGX A100"
DGX_SERIAL_NUMBER=
DGX_OTA_VERSION="5.6.0"
DGX_OTA_DATE="2024. 07. 26. (금) 17:07:03 KST"
Thank you in advance for your assistance!