MistralAI models, Mistral-7B, Mistral-7B-Instruct, Mixtral-8x7B, Mixtral-8x7B-Instruct

I’m trying to load the Mixtral-8x7B-Instruct-v0.1 model (also trying the other models cited above) inside the Triton Inference server to test different functionalities as ChatBot, NER, Summarization and others.

My OS is Ubuntu 22.04
My GPU is RTX 4000 ADA Lovelace with 20Gb

In my /home/models/Mixtral-8x7B-Instruct-v0.1 folder I have this file structure:
.
├── Dockerfile
├── README.md
├── pycache
│ └── model.cpython-310.pyc
├── buildimage.sh
├── config.json
├── config.pbtxt
├── consolidated.00.pt
├── consolidated.01.pt
├── consolidated.02.pt
├── consolidated.03.pt
├── consolidated.04.pt
├── consolidated.05.pt
├── consolidated.06.pt
├── consolidated.07.pt
├── generation_config.json
├── model-00001-of-00019.safetensors
├── model-00002-of-00019.safetensors
├── model-00003-of-00019.safetensors
├── model-00004-of-00019.safetensors
├── model-00005-of-00019.safetensors
├── model-00006-of-00019.safetensors
├── model-00007-of-00019.safetensors
├── model-00008-of-00019.safetensors
├── model-00009-of-00019.safetensors
├── model-00010-of-00019.safetensors
├── model-00011-of-00019.safetensors
├── model-00012-of-00019.safetensors
├── model-00013-of-00019.safetensors
├── model-00014-of-00019.safetensors
├── model-00015-of-00019.safetensors
├── model-00016-of-00019.safetensors
├── model-00017-of-00019.safetensors
├── model-00018-of-00019.safetensors
├── model-00019-of-00019.safetensors
├── model.safetensors.index.json
├── oldconfig.pbtxt
├── orgmodel.py
├── special_tokens_map.json
├── startserver.sh
├── testcurl.sh
├── testpayload.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

And the relevant contents are stored in the following files:

config.pbtxt
name: “Mixtral-8x7B-Instruct-v0.1.tensorflow”
backend: “tensorflow”
platform: “tensorflow_savedmodel”
max_batch_size: 1
input [
{
name: “input_1”
data_type: TYPE_FP32
dims: [ 1, 8, 22, 1 ]
}
]
output [
{
name: “output_1”
data_type: TYPE_FP32
dims: [ 1, 1 ]
}
]

Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.04-py3
USER root
WORKDIR /models
USER root

buildimage.sh
sudo DOCKER_BUILDKIT=1 docker buildx build . --tag mixtral-8x7b-instruct-v0.1:24.04-py3

startserver.sh
docker run --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/models/mixtral-8x7b-instruct-v0.1:/models mixtral-8x7b-instruct-v0.1:24.04-py3 tritonserver --model-repository=/models

Starting the server with ./startserver.sh I obtain this log:

== Triton Inference Server ==

NVIDIA Release 24.04 (build 90085237)
Triton Server Version 2.45.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:

I0617 07:16:37.704109 1 pinned_memory_manager.cc:275] Pinned memory pool is created at ‘0x7f1fcc000000’ with size 268435456
I0617 07:16:37.704233 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
E0617 07:16:37.704654 1 model_repository_manager.cc:1335] Poll failed for model directory ‘1’: Invalid model name: Could not determine backend for model ‘1’ with no backend in model configuration. Expected model name of the form ‘model.<backend_name>’.
I0617 07:16:37.704678 1 server.cc:607]
±-----------------±-----+
| Repository Agent | Path |
±-----------------±-----+
±-----------------±-----+

I0617 07:16:37.704683 1 server.cc:634]
±--------±-----±-------+
| Backend | Path | Config |
±--------±-----±-------+
±--------±-----±-------+

I0617 07:16:37.704688 1 server.cc:677]
±------±--------±-------+
| Model | Version | Status |
±------±--------±-------+
±------±--------±-------+

I0617 07:16:37.749855 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA RTX 4000 SFF Ada Generation
I0617 07:16:37.751136 1 metrics.cc:770] Collecting CPU metrics
I0617 07:16:37.751234 1 tritonserver.cc:2538]
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.45.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0617 07:16:37.751238 1 server.cc:307] Waiting for in-flight requests to complete.
I0617 07:16:37.751239 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0617 07:16:37.751244 1 server.cc:338] All models are stopped, unloading models
I0617 07:16:37.751245 1 server.cc:347] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Any help will be greatly appreciated

Fabrizio, Rome, Italy