PTX compiled with an unsupported toolchain error Running DLIB on Google Kubernetes with CUDA

I am trying to run DLib for face detection on Google Kubernetes Engine. However, I continually run into the following error.

detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
RuntimeError: Error while calling cudaMallocHost(&data, new_size*sizeof(float)) in file /dlib/dlib/cuda/gpu_data.cpp:211. code: 222, reason: the provided PTX was compiled with an unsupported toolchain.

This would suggest that there is a mismatch between the driver and compilation toolchain. However, I am reasonably certain that the compilation toolchain and driver are indeed compatible. The Google Kubernetes Engine pod is running an NVIDIA Tesla T4 GPU with an R470 driver. I verified this is the case by checking the pod itself (ssh into the cluster).

root@worker:/usr/app# nvidia-smi
Sat Nov 11 18:17:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@resize-workers-statefulset-0:/usr/app#

To compile and run DLib, I am using an official NVIDIA docker image with CUDA 11.8. According to NVIDIA’s documentation and CUDA 12.3 Release Notes, CUDA 11.8 is indeed compatible with the 470.182.03 driver version (since it exceeds 450.80.02).

I further verified this with a super simple test Dockerfile:

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

COPY simple_cuda_test.cu /simple_cuda_test.cu
RUN nvcc -o simple_cuda_test /simple_cuda_test.cu

CMD ["./simple_cuda_test"]

where the test_dlib.py file is as follows:

#include <stdio.h>

__global__ void add(int a, int b, int *c) {
    *c = a + b;
}

int main() {
    int c;
    int *dev_c;

    // Allocate memory on the GPU
    cudaMalloc((void**)&dev_c, sizeof(int));

    // Launch the add() kernel on the GPU
    add<<<1,1>>>(2, 7, dev_c);

    // Copy the result back to the host
    cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);

    printf("2 + 7 = %d\n", c);

    // Cleanup
    cudaFree(dev_c);

    return 0;
}

This yields the following output:

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

2 + 7 = 1

I then created the following Dockerfile to test dlib’s cnn_face_detection_model_v1 model:

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

# dependencies
RUN apt-get update && \
    apt-get install -y \
    --no-install-recommends --no-install-suggests \
    gcc-11 g++-11 \
    git \
    build-essential \
    cmake \
    libboost-all-dev \
    libopenblas-dev \
    liblapack-dev \
    libavdevice-dev \
    libavfilter-dev \
    libavformat-dev \
    libavcodec-dev \
    libswresample-dev \
    libswscale-dev \
    libavutil-dev \
    python3 \
    python3-venv \
    python3-dev \
    python3-distutils \
    python3-pip \
    libmagic1 \
    pkg-config && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# install dlib
RUN git clone https://github.com/davisking/dlib.git /dlib && \
    cd /dlib && \
    python3 setup.py install --clean

ENV PYTHONPATH=/usr/app \
    DEBIAN_FRONTEND=noninteractive \
    PATH="/usr/local/cuda-11.8/lib64:$PATH" \
    CUDA_HOME="/usr/local/cuda-11.8" \
    LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"

# simple test files for dlib
COPY mmod_human_face_detector.dat mmod_human_face_detector.dat
COPY test_dlib.py test_dlib.py
COPY test_image.jpg test_image.jpg

CMD ["python3", "test_dlib.py"]

where the test_dlib.py file is as follows:

import dlib
import time

print("dlib version: {}".format(dlib.__version__))

# Check if Dlib was compiled with CUDA support
if dlib.DLIB_USE_CUDA:
    print("Dlib was compiled with CUDA support.")
else:
    print("Dlib was NOT compiled with CUDA support.")

# Check if CUDA is currently available
if dlib.cuda.get_num_devices() > 0:
    print("CUDA is available. Number of CUDA devices:", dlib.cuda.get_num_devices())
else:
    print("CUDA is not available.")

detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")

# Load the image
image_path = "test_image.jpg"
image = dlib.load_rgb_image(image_path)

start = time.time()
dets = detector(image, 1)
end = time.time()
print("detection time: {}".format(end - start))

print("Number of faces detected: {}".format(len(dets)))

Running this Dockerfile on the pod yields the following output:

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "//test_dlib.py", line 18, in <module>
dlib version: 19.24.99
Dlib was compiled with CUDA support.
CUDA is available. Number of CUDA devices: 1
    detector = dlib.cnn_face_detection_model_v1("mmod_human_face_detector.dat")
RuntimeError: Error while calling cudaMallocHost(&data, new_size*sizeof(float)) in file /dlib/dlib/cuda/gpu_data.cpp:211. code: 222, reason: the provided PTX was compiled with an unsupported toolchain.

Any ideas on what the issue could be?

(As a side note, I heavily prefer using CUDA 11.8. I’ve tried downgrading to CUDA 11.4 but this introduces a host of other dependency issues and complications with the python application I’m running.)

about 99.99999999999999999999999999% of the time, this error means that you should update your GPU driver to the latest available for your GPU.

attempting to use a driver that advertises CUDA 11.4 support with a CUDA 11.8 toolchain is another good indication that you should update your GPU driver to the latest available for your GPU.

I won’t be able to comment on whether your R470 driver should work with CUDA 11.8. Do as you wish, of course.