Triton container keep Restarting

Hi, I try to run this script, but get some problem below:

My docker-compose.yml (I can’t find the GPU driver if I don’t rewrite docker-compose.yml):

# SPDX-License-Identifier: Apache-2.0

version: "3.8"
services:
  claratrain:
    container_name: claradevday-pt
    hostname: claratrain
    ##### use vanilla clara train docker
    #image: nvcr.io/nvidia/clara-train-sdk:v4.0
    ##### to build image with GPU dashboard inside jupyter lab
    build:
      context: ./dockerWGPUDashboardPlugin/    # Project root
      dockerfile: ./Dockerfile                 # Relative to context
    image: clara-train-nvdashboard:v4.0
    depends_on:
      - tritonserver
    ports:
      - "3030:8888"  # Jupyter lab port
      - "3031:5000"  # AIAA port
    ipc: host
    volumes:
      - ${TRAIN_DEV_DAY_ROOT}:/claraDevDay/
      - /raid/users/aharouni/data:/data/
    command: "jupyter lab /claraDevDay --ip 0.0.0.0 --allow-root --no-browser --config /claraDevDay/scripts/jupyter_notebook_config.py"
#    command: tail -f /dev/null
#    tty: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [ gpu ]
              # To specify certain GPU uncomment line below
              #device_ids: ['0,3']
#############################################################
  tritonserver:
    image: nvcr.io/nvidia/tritonserver:21.02-py3
    container_name: aiaa-triton
    hostname: tritonserver
    restart: unless-stopped
    command: >
      sh -c "chmod 777 /triton_models &&
        /opt/tritonserver/bin/tritonserver \
          --model-store /triton_models \
          --model-control-mode="poll" \
          --repository-poll-secs=5 \
          --log-verbose ${TRITON_VERBOSE}"
    volumes:
      - ${TRAIN_DEV_DAY_ROOT}/AIAA/workspace/triton_models:/triton_models
user@userT:~/clara-train-examples/PyTorch/NoteBooks/scripts$ ./startClaraTrainNoteBooks.sh 
user@userT:~/clara-train-examples/PyTorch/NoteBooks/scripts$ docker ps
CONTAINER ID   IMAGE                                   COMMAND                  CREATED          STATUS                          PORTS                                                                                                                             NAMES
04cbcc87349e   clara-train-nvdashboard:v4.0            "/usr/local/bin/nvid…"   45 minutes ago   Up 45 minutes                   0.0.0.0:6006->6006/tcp, :::6006->6006/tcp, 0.0.0.0:3031->5000/tcp, :::3031->5000/tcp, 0.0.0.0:3030->8888/tcp, :::3030->8888/tcp   claradevday-pt
c5ff214d5023   nvcr.io/nvidia/tritonserver:21.05-py3   "/opt/tritonserver/n…"   45 minutes ago   Restarting (1) 49 seconds ago                                                                                                                                     aiaa-triton

and the log:

user@userT:~/clara-train-examples/PyTorch/NoteBooks/scripts$ docker logs --tail 50 --follow --timestamps c5ff214d5023

2021-06-15T04:11:09.925816470Z Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved.
2021-06-15T04:11:09.925818637Z
2021-06-15T04:11:09.925820637Z Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
2021-06-15T04:11:09.925822724Z
2021-06-15T04:11:09.925824670Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2021-06-15T04:11:09.925827252Z By pulling and using the container, you accept the terms and conditions of this license:
2021-06-15T04:11:09.925834259Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2021-06-15T04:11:10.030918752Z
2021-06-15T04:11:10.030932899Z WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
2021-06-15T04:11:10.030935148Z    Use Docker with NVIDIA Container Toolkit to start this container; see
2021-06-15T04:11:10.030936767Z    https://github.com/NVIDIA/nvidia-docker.
2021-06-15T04:11:10.132848909Z ln: failed to create symbolic link '/opt/tritonserver/lib/libnvidia-ml.so.1': File exists

Actually, I can training model with GPU in this example.

Have any suggestions? Many Thank!

Hi
Thanks for your interest in clara train SDK
The trition container seems to be restarting as it faces errors

you might have an old docker compose ? or not have nvidia driver as default to docker deamon
can you try running the installDocker.sh found in the scripts folder. this should detect any issues and upgrade docker-compose if needed

Hi you need to make sure this part also under the tritonserver.
Otherwise the tritonserver can not access GPU.

Another thing I notice, from your docker ps.
I see “nvcr.io/nvidia/tritonserver:21.05-py3”.
I think you should use 21.02 with Clara-Train 4.0.

Thanks

Hi @aquraini , @yuantingh ,
Thank your reply.
My Solution is remove all of clara-train-example folder, and redownload again clara-train-example folder.
Actually, It can be used normally before this error occurs, but I don’t know why it broke suddenly.