Failing to detect GPU's on Tesla T4's

Hi all,

We have been running into issues when running HPC workloads. These are the typical errors we receive:
RuntimeError: No CUDA GPUs are available

RuntimeError: Cannot access accelerator device when none is available.

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

In short, our workflow does training in parallell and inference. In total, a single workflow will try to access the GPU’s on the host machine about 8 times during a run. We have successfully ran about 30 workflows without any issues, and then suddenly these errors pop up. Now, the odd thing is that most of the time we are able to run our workloads without issue. The workflows are not different from one run to another, just different input data(which does not affect the access to GPU’s and the parallelization).

There are several components to this. We have been hesitant to post a thread here since the overall setup does not fully concern NVIDIA, but it is a crucial part and we are now running out of ideas.

To describe our setup:

We are running our workflows in Azure Batch. This means that the Tesla T4 VM’s we use, are provided from Azure and are deemed as fully functioning hardware.

We are running containerized workloads, meaning we run our scripts and all from docker images in the node. This involves docker, as it needs to be able to talk to the GPU’s as well. Our docker images look like this:

FROM nvidia/cuda:12.6.0-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \
    apt-get install -y python3 python3-pip libglib2.0-0 libsm6 libxext6 libxrender-dev python3-opencv build-essential git gcc g++ && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /usr/app/ourapp
WORKDIR /usr/app

COPY ./requirements.txt ./
RUN python3 -m pip install --upgrade pip && pip install -r requirements.txt


COPY ./ourapp ./ourapp


ENV PYTHONPATH="/usr/app"
ENV PYTHONUNBUFFERED=1
ENTRYPOINT ["python3", "-u", "./ourapp/some_scripts.py"]

When running Azure Batch, the VM’s need an OS image to be able to run. This OS image needs to have the relevant NVIDIA and CUDA drivers, as well as other toolkits and extensions. We have previously used a standard OS image from microsoft: microsoft-dsvm ubuntu-hpc 2204 (latest), which has pre-installed all the necessary drivers and toolkits to run using GPUs. As promising as it sounds, the microsoft-dsvm ubuntu-hpc 2204 (latest) image has proven to also fail, giving us error messages of the same nature as described above. This has lead us to create our own custom OS image. A custom OS image is essentially taking a pre-existing image, such as for example microsoft-dsvm ubuntu-hpc 2204 (latest) and making modifications. We took the pre-existing image canonical 0001-com-ubuntu-server-jammy 22_04-lts-gen2(latest), which has few modifications on it. We did modifications as follows:

  • First install docker engine(we run our workloads containerized):
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world# To confirm that it works
  • CUDA and NVIDIA drivers
sudo apt install linux-headers-$(uname -r)
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9
sudo apt-get install -y nvidia-open
sudo apt -V install libnvidia-compute-575 nvidia-dkms-575-opensudo apt update
sudo apt install -y nvidia-gds
sudo reboot
  • NVDIDA Toolkit for Docker
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Then modifications are done and we confirm that the setup works by running:

sudo apt install python3.10-venv
python3 -m venv venv
source venv/bin/activate
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
python3
import torch 
print(torch.cuda.is_available()) # true

When all the modifcations are done, we capture the VM’s state(with all the modifications) and use this state as the OS image for all future VM’s that run our workloads.

As previously stated, we have been able to run several workloads successfully, and then it suddenly fails. We truly hope there are some NVIDIA/CUDA experts that can help us with this. Would be truly appreciated.

Cheers,
Tov

Hi Tov,

Unfortunately, I don’t think we’ll be able to help you here. This forum is for the NVHPC Compilers (nvc, nvc++, and nvfortran) so outside our expertise.

Also, I’m not entirely sure where to send you. From what I’m reading, it sounds more like system issues, like the job isn’t getting scheduled on a system with a T4, or the GPU isn’t getting properly setup for some of the runs. Have you talked with anyone from Azure? That seems like the best place to start.

-Mat

Hi Mat,

Thanks for replying so quickly.

My apologies is this was the incorrect forum to ask, I do not know all the nuances when it comes to NVIDIA.

We have been in contact with Azure Support several times and they have provided good help, but to no avail other than that we assume that it has to do with the OS base image being at fault. Azure support has claimed that everything in the service Azure Batch( and all it provides) works as intended and that this is a client error(our fault). Azure support have tried to help us with the setup, but our demands and needs have not been fulfilled or met with satisfaction, partially due to the fact that technologies which are not Azure’s responsibility, such as Docker, NVIDIA and CUDA, are in play.

To make it more relevant in this forum;
If we forget about Azure batch for a minute, could you provide any input on the scripts we run, when we install drivers and toolkits for compatibility with Tesla T4 GPUs?

If we forget about Azure batch for a minute, could you provide any input on the scripts we run, when we install drivers and toolkits for compatibility with Tesla T4 GPUs?

Well, if you had a question on how to do GPU offloading in a Fortran program, I’m your guy, but this, sorry no idea. I’m asking around my team to see if anyone else can help.

There is a NGC-Cloud forum with a Azure sub-forum which seems a better fit for this question, but I hesitate to send you there since it doesn’t appear to be moderated and very few of the questions have responses.

I asked the folks how build the NVHPC containers, but they’re not sure either.

Though they did have a question. When you say:

When all the modifcations are done, we capture the VM’s state(with all the modifications) and use this state as the OS image for all future VM’s that run our workloads.
As previously stated, we have been able to run several workloads successfully, and then it suddenly fails.

Are you saving the the state of the VM and then restoring it on a separate VM on another system?

This seems problematic and they expect that you’d need to restart the VM after restoring it. Otherwise the systems may be different and you could encounter issues.

Though they may not be clear on what you mean by this statement.

I might try to send the CUDA and NVIDIA installation process to this CUDA foum, see if anything happens.

Are you saving the the state of the VM and then restoring it on a separate VM on another system?

To first clarify: We boot up a VM with a base ubuntu OS. We install the toolkit and packages we need. Then, while the VM is running, Azure has support to Capture the current OS state of the VM. What is captured, we give to the Azure Batch service which will use it on another VM as its system. Whatever Azure Batch does with the image or VM is difficult to say.

This is a great question we have been asking ourselves. We tell the Azure Batch service, that any VM’s it spins up, should have this OS that we create. Well what does that mean we ask ourselves? Does Azure restart the VM after restoring it or do they perform any updates that could alter the state in a bad way? Are they actually making sure that the OS state is the same as the OS we tell it to use? It looks like the state changes, or can change, for each VM, because of the irregular occurrence of errors and successes.

They have not given us any good information about what really happens after spinning up a VM. Do you think that we should reboot the node explicitly?

Wish we could help more here, but it’s very far outside our area of expertise.

You might have better luck on the CUDA installation forum, but your issue is very specific so I don’t know. Worth a try.