UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Hello everyone,

I’m trying to load and run the mistral-7b-instruct-v0.3 model on my NVIDIA GeForce RTX 4090 (16 GB GDDR6 GPU) in a Docker container on a Windows 11 Pro laptop.

Issue
When I attempt to initialize the model, I receive the following error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf8 in position 0: invalid start byte

Below are further details on the error trace from the model initialization attempt:

C:\Windows\System32>docker run -it --rm --gpus all --shm-size=16GB -e NGC_API_KEY -v “%LOCAL_NIM_CACHE%:/opt/nim/.cache” -p 8000:8000 nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==

NVIDIA Inference Microservice LLM NIM Version 1.1.2
Model: mistralai/mistral-7b-instruct-v0.3

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License here: NVIDIA Agreements | Enterprise Software | NVIDIA AI Foundation Models Community License Agreement.

ADDITIONAL INFORMATION: Apache 2.0 License (Apache License, Version 2.0).

INFO 11-12 04:02:08.894 ngc_profile.py:222] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 11-12 04:02:08.894 ngc_profile.py:224] Detected 1 compatible profile(s).
INFO 11-12 04:02:08.894 ngc_injector.py:132] Valid profile: 0912dfabc816a9819b209b4ab0ba63b3dcd92e33ea1d14eb161cf9f12c83e626 (vllm-bf16-tp1) on GPUs [0]
INFO 11-12 04:02:08.894 ngc_injector.py:190] Selected profile: 0912dfabc816a9819b209b4ab0ba63b3dcd92e33ea1d14eb161cf9f12c83e626 (vllm-bf16-tp1)
INFO 11-12 04:02:08.901 ngc_injector.py:198] Profile metadata: feat_lora: false
INFO 11-12 04:02:08.902 ngc_injector.py:198] Profile metadata: llm_engine: vllm
INFO 11-12 04:02:08.902 ngc_injector.py:198] Profile metadata: precision: bf16
INFO 11-12 04:02:08.902 ngc_injector.py:198] Profile metadata: tp: 1
INFO 11-12 04:02:08.902 ngc_injector.py:218] Preparing model workspace. This step might download additional files to run the model.
INFO 11-12 04:02:09.128 ngc_injector.py:233] Model workspace is now ready. It took 0.226 seconds
INFO 11-12 04:02:09.144 launch.py:46] engine_world_size=1
INFO 11-12 04:02:09.144 launch.py:92] running command [‘/opt/nim/llm/.venv/bin/python3’, ‘-m’, ‘vllm_nvext.entrypoints.openai.api_server’, ‘–served-model-name’, ‘mistralai/mistral-7b-instruct-v0.3’, ‘–async-engine-args’, ‘{“model”: “/tmp/mistralai–mistral-7b-instruct-v0.3-erehbdyf”, “served_model_name”: [“mistralai/mistral-7b-instruct-v0.3”], “tokenizer”: “/tmp/mistralai–mistral-7b-instruct-v0.3-erehbdyf”, “skip_tokenizer_init”: false, “tokenizer_mode”: “auto”, “trust_remote_code”: false, “download_dir”: null, “load_format”: “auto”, “dtype”: “auto”, “kv_cache_dtype”: “auto”, “quantization_param_path”: null, “seed”: 0, “max_model_len”: null, “worker_use_ray”: false, “distributed_executor_backend”: “ray”, “pipeline_parallel_size”: 1, “tensor_parallel_size”: 1, “max_parallel_loading_workers”: null, “block_size”: 16, “enable_prefix_caching”: false, “disable_sliding_window”: false, “use_v2_block_manager”: false, “swap_space”: 4, “cpu_offload_gb”: 0, “gpu_memory_utilization”: 0.9, “max_num_batched_tokens”: null, “max_num_seqs”: 256, “max_logprobs”: 20, “disable_log_stats”: false, “revision”: null, “code_revision”: null, “rope_scaling”: null, “rope_theta”: null, “tokenizer_revision”: null, “quantization”: null, “enforce_eager”: false, “max_context_len_to_capture”: null, “max_seq_len_to_capture”: 8192, “disable_custom_all_reduce”: false, “tokenizer_pool_size”: 0, “tokenizer_pool_type”: “ray”, “tokenizer_pool_extra_config”: null, “enable_lora”: false, “max_loras”: 8, “max_lora_rank”: 32, “enable_prompt_adapter”: false, “max_prompt_adapters”: 1, “max_prompt_adapter_token”: 0, “fully_sharded_loras”: false, “lora_extra_vocab_size”: 256, “long_lora_scaling_factors”: null, “lora_dtype”: “auto”, “max_cpu_loras”: 16, “peft_source”: null, “peft_refresh_interval”: null, “device”: “auto”, “ray_workers_use_nsight”: false, “num_gpu_blocks_override”: null, “num_lookahead_slots”: 0, “model_loader_extra_config”: null, “ignore_patterns”: , “preemption_mode”: null, “scheduler_delay_factor”: 0.0, “enable_chunked_prefill”: null, “guided_decoding_backend”: “lm-format-enforcer”, “speculative_model”: null, “speculative_draft_tensor_parallel_size”: null, “num_speculative_tokens”: null, “speculative_max_model_len”: null, “speculative_disable_by_batch_size”: null, “ngram_prompt_lookup_max”: null, “ngram_prompt_lookup_min”: null, “spec_decoding_acceptance_method”: “rejection_sampler”, “typical_acceptance_sampler_posterior_threshold”: null, “typical_acceptance_sampler_posterior_alpha”: null, “qlora_adapter_name_or_path”: null, “disable_logprobs_during_spec_decoding”: null, “otlp_traces_endpoint”: null, “engine_use_ray”: false, “disable_log_requests”: true, “selected_gpus”: [{“name”: “NVIDIA GeForce RTX 4090 Laptop GPU”, “device_index”: 0, “device_id”: “2757:10de”, “total_memory”: 17171480576, “free_memory”: 16829644800, “used_memory”: 0, “reserved_memory”: 341835776, “family”: null}]}’]
[1731384132.457883] [804af0c89968:49 :0] parser.c:2305 UCX WARN unused environment variables: UCX_HOME; UCX_DIR (maybe: UCX_TLS?)
[1731384132.457883] [804af0c89968:49 :0] parser.c:2305 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-11-12 04:02:15,569 [INFO] PyTorch version 2.3.1 available.
2024-11-12 04:02:20,257 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-11-12 04:02:20,257 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-11-12 04:02:20,282 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.11.1.dev20240721
INFO 11-12 04:02:20.325 api_server.py:625] NIM LLM API version 1.1.2
Traceback (most recent call last):
File “/usr/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py”, line 703, in
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File “/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py”, line 33, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 463, in from_engine_args
executor_class = cls._get_executor_cls(engine_config)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py”, line 435, in _get_executor_cls
initialize_ray_cluster(engine_config.parallel_config)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/executor/ray_utils.py”, line 90, in initialize_ray_cluster
ray.init(address=ray_address, ignore_reinit_error=True)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/worker.py”, line 1642, in init
_global_node = ray._private.node.Node(
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/node.py”, line 336, in init
self.start_ray_processes()
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/node.py”, line 1396, in start_ray_processes
resource_spec = self.get_resource_spec()
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/node.py”, line 571, in get_resource_spec
self._resource_spec = ResourceSpec(
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/resource_spec.py”, line 215, in resolve
accelerator_manager.get_current_node_accelerator_type()
File “/opt/nim/llm/.venv/lib/python3.10/site-packages/ray/_private/accelerators/nvidia_gpu.py”, line 71, in get_current_node_accelerator_type
device_name = device_name.decode(“utf-8”)
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf8 in position 0: invalid start byte

Any guidance or suggestions to address this memory error would be greatly appreciated!

Thank you for your help.