Llama-3_3-70b-instruct cannot select tensorrt-llm on L40s

many attempts to config on EC2 g6e.12xlarge with 4 x L40s

/tmp/nim–meta–llama-3_3-70b-instruct-cftcu7d8. Expected model format to be one of [‘hf-safetensor’, ‘trtllm-engine’, ‘trtllm-ckpt’, ‘gguf’].

# NVIDIA NIM Bug Report: TensorRT-LLM Profile Incompatible with NGC Model Format

## Summary

NIM container fails to recognize NGC-downloaded TensorRT-LLM engines when using explicit TensorRT-LLM profile, despite successful model download and profile validation.

## Environment

- **NIM Version**: 1.12.0

- **Container**: `nvcr.io/nim/meta/llama-3.3-70b-instruct:latest`

- **Hardware**: AWS EC2 g6e.12xlarge (4x NVIDIA L40S GPUs)

- **Model**: Llama 3.3 70B Instruct

- **Profile ID**: `668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c` (tensorrt_llm)

## Bug Description

When using a valid NGC model URL with an explicit TensorRT-LLM profile, NIM successfully downloads the model but fails during format validation, claiming the downloaded format doesn’t match expected TensorRT-LLM structure.

## Reproduction Steps

1. **Working NGC Download** (validates model name format):

```bash

docker run -d --name nemo \

–gpus all \

--shm-size=16GB \

-e NGC_API_KEY \

-e NIM_MODEL_NAME=‘ngc://nim/meta/llama-3.3-70b-instruct’ \

-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \

-u 1000 \

-p 80:8000 \

nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

```

2. **Failing with TensorRT-LLM Profile**:

```bash

docker run -d --name nemo \

–gpus all \

--shm-size=16GB \

-e NGC_API_KEY \

-e NIM_MODEL_PROFILE=‘668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c’ \

-e NIM_MODEL_NAME=‘ngc://nim/meta/llama-3.3-70b-instruct’ \

-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \

-u 1000 \

-p 80:8000 \

nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

```

## Expected Behavior

The TensorRT-LLM profile should be compatible with NGC-downloaded TensorRT engines for the same model, since both are NVIDIA-provided components.

## Actual Behavior

### 1. Successful Profile Validation

```

INFO 2025-09-17 19:02:49.996 ngc_injector.py:149] Valid profile: 668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c (tensorrt_llm) on GPUs [0, 1, 2, 3]

INFO 2025-09-17 19:02:49.996 ngc_injector.py:302] Selected profile: 668b575f1701fa70a97cfeeae998b5d70b048a9b917682291bb82b67f308f80c (tensorrt_llm)

INFO 2025-09-17 19:02:49.996 ngc_injector.py:321] Profile metadata: llm_engine: tensorrt_llm

```

### 2. Successful Model Download

```

INFO 2025-09-17 19:02:49.369 ngc_injector.py:196] Model workspace is now ready. It took 0.466 seconds

INFO 2025-09-17 19:02:49.573 utils.py:125] Found following files in /tmp/nim–meta–llama-3_3-70b-instruct-cftcu7d8

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── checksums.blake3

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── config.json

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── metadata.json

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── rank0.engine

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── rank1.engine

INFO 2025-09-17 19:02:49.573 utils.py:129] ├── rank2.engine

INFO 2025-09-17 19:02:49.573 utils.py:129] └── rank3.engine

```

### 3. Format Validation Failure

```

ValueError: Invalid repository ID or local directory specified: /tmp/nim–meta–llama-3_3-70b-instruct-cftcu7d8. Expected model format to be one of [‘hf-safetensor’, ‘trtllm-engine’, ‘trtllm-ckpt’, ‘gguf’]. Please check NIM documentation for supported model formats and folder structures.

```

## Root Cause Analysis

The NGC model downloads TensorRT-LLM engines (`rank0.engine`, `rank1.engine`, etc.) but NIM’s format detection logic in `profile_utils.py` expects a specific directory structure:

**Expected for ‘trtllm-engine’:**

```

├── config.json

├── tokenizer files…

└── trtllm_engine/

├── config.json

├── rank0.engine

└── ...

```

**Actual NGC download:**

```

├── checksums.blake3

├── config.json

├── metadata.json

├── rank0.engine

├── rank1.engine

├── rank2.engine

└── rank3.engine

```

## Impact

- TensorRT-LLM profile cannot be used with NGC models

- Forces users to use vLLM (which may have memory constraints on smaller GPUs)

- Inconsistency between NVIDIA components (NGC registry vs NIM container expectations)

## Error Location

- **File**: `/opt/nim/llm/nim_llm_sdk/hub/profile_utils.py` (line 781, 791, 521)

- **Function**: `ProfileFilter._init_() → _update_allowed_backends() → evaluate_backend()`

## Suggested Fix

1. **Update format detection** to recognize NGC TensorRT engine format as valid ‘trtllm-engine’

2. **Add format converter** to restructure NGC downloads to expected directory layout

3. **Document compatibility** between NGC model formats and NIM profiles

## Failed all Workarounds

Use vLLM backend instead of TensorRT-LLM:

```bash

# Remove NIM_MODEL_PROFILE to auto-select vLLM

docker run -d --name nemo \

–gpus all \

-e NGC_API_KEY \

-e NIM_MODEL_NAME=‘ngc://nim/meta/llama-3.3-70b-instruct’ \

-e NIM_VLLM_EXTRA_ARGS=“–gpu-memory-utilization 0.98 --max-model-len 16384” \

nvcr.io/nim/meta/llama-3.3-70b-instruct:latest

```

Encounters a similar error.

## Additional Context

- Model download and profile validation both succeed independently

- The issue appears to be purely in the format detection/validation logic

- This affects any NGC model used with explicit TensorRT-LLM profiles

- Auto-selection works but may choose suboptimal backend for hardware

**Bug Report Venues:**

2. **NGC Support**: NVIDIA NGC

**Severity**: High (no workaround available, but affects TensorRT-LLM usage)

1 Like