Hi all,
We have been running into issues when running HPC workloads. These are the typical errors we receive:
RuntimeError: No CUDA GPUs are available
RuntimeError: Cannot access accelerator device when none is available.
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.
In short, our workflow does training in parallell and inference. In total, a single workflow will try to access the GPU’s on the host machine about 8 times during a run. We have successfully ran about 30 workflows without any issues, and then suddenly these errors pop up. Now, the odd thing is that most of the time we are able to run our workloads without issue. The workflows are not different from one run to another, just different input data(which does not affect the access to GPU’s and the parallelization).
There are several components to this. We have been hesitant to post a thread here since the overall setup does not fully concern NVIDIA, but it is a crucial part and we are now running out of ideas.
To describe our setup:
We are running our workflows in Azure Batch. This means that the Tesla T4 VM’s we use, are provided from Azure and are deemed as fully functioning hardware.
We are running containerized workloads, meaning we run our scripts and all from docker images in the node. This involves docker, as it needs to be able to talk to the GPU’s as well. Our docker images look like this:
FROM nvidia/cuda:12.6.0-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y python3 python3-pip libglib2.0-0 libsm6 libxext6 libxrender-dev python3-opencv build-essential git gcc g++ && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN mkdir -p /usr/app/ourapp
WORKDIR /usr/app
COPY ./requirements.txt ./
RUN python3 -m pip install --upgrade pip && pip install -r requirements.txt
COPY ./ourapp ./ourapp
ENV PYTHONPATH="/usr/app"
ENV PYTHONUNBUFFERED=1
ENTRYPOINT ["python3", "-u", "./ourapp/some_scripts.py"]
When running Azure Batch, the VM’s need an OS image to be able to run. This OS image needs to have the relevant NVIDIA and CUDA drivers, as well as other toolkits and extensions. We have previously used a standard OS image from microsoft: microsoft-dsvm ubuntu-hpc 2204 (latest)
, which has pre-installed all the necessary drivers and toolkits to run using GPUs. As promising as it sounds, the microsoft-dsvm ubuntu-hpc 2204 (latest)
image has proven to also fail, giving us error messages of the same nature as described above. This has lead us to create our own custom OS image. A custom OS image is essentially taking a pre-existing image, such as for example microsoft-dsvm ubuntu-hpc 2204 (latest)
and making modifications. We took the pre-existing image canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2(latest)
, which has few modifications on it. We did modifications as follows:
- First install docker engine(we run our workloads containerized):
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world# To confirm that it works
- CUDA and NVIDIA drivers
sudo apt install linux-headers-$(uname -r)
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9
sudo apt-get install -y nvidia-open
sudo apt -V install libnvidia-compute-575 nvidia-dkms-575-opensudo apt update
sudo apt install -y nvidia-gds
sudo reboot
- NVDIDA Toolkit for Docker
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Then modifications are done and we confirm that the setup works by running:
sudo apt install python3.10-venv
python3 -m venv venv
source venv/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
python3
import torch
print(torch.cuda.is_available()) # true
When all the modifcations are done, we capture the VM’s state(with all the modifications) and use this state as the OS image for all future VM’s that run our workloads.
As previously stated, we have been able to run several workloads successfully, and then it suddenly fails. We truly hope there are some NVIDIA/CUDA experts that can help us with this. Would be truly appreciated.
Cheers,
Tov