MistralAI models, Mistral-7B, Mistral-7B-Instruct, Mixtral-8x7B, Mixtral-8x7B-Instruct

vmsqlserver2k19 · June 17, 2024, 7:23am

I’m trying to load the Mixtral-8x7B-Instruct-v0.1 model (also trying the other models cited above) inside the Triton Inference server to test different functionalities as ChatBot, NER, Summarization and others.

My OS is Ubuntu 22.04
My GPU is RTX 4000 ADA Lovelace with 20Gb

In my /home/models/Mixtral-8x7B-Instruct-v0.1 folder I have this file structure:
.
├── Dockerfile
├── README.md
├── pycache
│ └── model.cpython-310.pyc
├── buildimage.sh
├── config.json
├── config.pbtxt
├── consolidated.00.pt
├── consolidated.01.pt
├── consolidated.02.pt
├── consolidated.03.pt
├── consolidated.04.pt
├── consolidated.05.pt
├── consolidated.06.pt
├── consolidated.07.pt
├── generation_config.json
├── model-00001-of-00019.safetensors
├── model-00002-of-00019.safetensors
├── model-00003-of-00019.safetensors
├── model-00004-of-00019.safetensors
├── model-00005-of-00019.safetensors
├── model-00006-of-00019.safetensors
├── model-00007-of-00019.safetensors
├── model-00008-of-00019.safetensors
├── model-00009-of-00019.safetensors
├── model-00010-of-00019.safetensors
├── model-00011-of-00019.safetensors
├── model-00012-of-00019.safetensors
├── model-00013-of-00019.safetensors
├── model-00014-of-00019.safetensors
├── model-00015-of-00019.safetensors
├── model-00016-of-00019.safetensors
├── model-00017-of-00019.safetensors
├── model-00018-of-00019.safetensors
├── model-00019-of-00019.safetensors
├── model.safetensors.index.json
├── oldconfig.pbtxt
├── orgmodel.py
├── special_tokens_map.json
├── startserver.sh
├── testcurl.sh
├── testpayload.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json

And the relevant contents are stored in the following files:

config.pbtxt
name: “Mixtral-8x7B-Instruct-v0.1.tensorflow”
backend: “tensorflow”
platform: “tensorflow_savedmodel”
max_batch_size: 1
input [
{
name: “input_1”
data_type: TYPE_FP32
dims: [ 1, 8, 22, 1 ]
}
]
output [
{
name: “output_1”
data_type: TYPE_FP32
dims: [ 1, 1 ]
}
]

Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.04-py3
USER root
WORKDIR /models
USER root

buildimage.sh
sudo DOCKER_BUILDKIT=1 docker buildx build . --tag mixtral-8x7b-instruct-v0.1:24.04-py3

startserver.sh
docker run --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/models/mixtral-8x7b-instruct-v0.1:/models mixtral-8x7b-instruct-v0.1:24.04-py3 tritonserver --model-repository=/models

Starting the server with ./startserver.sh I obtain this log:

== Triton Inference Server ==

NVIDIA Release 24.04 (build 90085237)
Triton Server Version 2.45.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:

I0617 07:16:37.704109 1 pinned_memory_manager.cc:275] Pinned memory pool is created at ‘0x7f1fcc000000’ with size 268435456
I0617 07:16:37.704233 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
E0617 07:16:37.704654 1 model_repository_manager.cc:1335] Poll failed for model directory ‘1’: Invalid model name: Could not determine backend for model ‘1’ with no backend in model configuration. Expected model name of the form ‘model.<backend_name>’.
I0617 07:16:37.704678 1 server.cc:607]
±-----------------±-----+
| Repository Agent | Path |
±-----------------±-----+
±-----------------±-----+

I0617 07:16:37.704683 1 server.cc:634]
±--------±-----±-------+
| Backend | Path | Config |
±--------±-----±-------+
±--------±-----±-------+

I0617 07:16:37.704688 1 server.cc:677]
±------±--------±-------+
| Model | Version | Status |
±------±--------±-------+
±------±--------±-------+

I0617 07:16:37.749855 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA RTX 4000 SFF Ada Generation
I0617 07:16:37.751136 1 metrics.cc:770] Collecting CPU metrics
I0617 07:16:37.751234 1 tritonserver.cc:2538]
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.45.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
±---------------------------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0617 07:16:37.751238 1 server.cc:307] Waiting for in-flight requests to complete.
I0617 07:16:37.751239 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0617 07:16:37.751244 1 server.cc:338] All models are stopped, unloading models
I0617 07:16:37.751245 1 server.cc:347] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Any help will be greatly appreciated

Fabrizio, Rome, Italy

Topic		Replies	Views
Mistral AI Models TensorRT cudnn	1	332	June 25, 2024
Inferencing on DINO in triton inference server TensorRT inference-server-triton	1	65	August 29, 2024
`Error No Op registered for NMSDynamic_TRT...` when trying to run Trition inference server with a SSD model TAO Toolkit jetson	12	1245	October 12, 2023
Unable to run Triton example TensorRT inference-server-triton	1	900	May 31, 2024
Triton infererence server example 'simple_grpc_infer_client.py' DeepStream SDK	11	5047	March 23, 2022
Triton Image for jetson nano TAO Toolkit	6	786	July 6, 2022
Regarding when we execute triton server on jetson orin getting an error unable to load model DeepStream SDK cuda	19	784	July 30, 2024
Encounter "Unsupported model IR version: 9, max supported IR version: 8" during deploy custom model in riva for TTS Riva onnx , riva	9	3290	January 22, 2024
I can't run deepstream-lidar-inference-app on jetson nano. It will report an error! DeepStream SDK	11	365	September 28, 2023
GPU support with Triton iGPU image and Python Backend Jetson Orin Nano python	9	324	October 14, 2024

MistralAI models, Mistral-7B, Mistral-7B-Instruct, Mixtral-8x7B, Mixtral-8x7B-Instruct

Starting the server with ./startserver.sh I obtain this log:

== Triton Inference Server ==

Related topics