(VSS 2.3.0) Issue with Using vila and nvila Models in VSS Deployment

  • Hardware Platform (GPU model and numbers)
    NVIDIA B200 * 8 - 1node

  • System Memory
    2.0Ti

  • Ubuntu Version
    24.04

  • NVIDIA GPU Driver Version (valid for GPU only)
    570.124.06

  • Issue Type( questions, new requirements, bugs)
    questions

  • How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)


Hello nvidia team,

We deployed vss-blueprint (nvidia-blueprint-vss-2.3.0) with the configuration values below, but an error occurs in the vila model.

Here are our Helm value settings. For the rerank and embedding models, specifying the image tag in the values did not take effect, so we updated the images manually using:

microk8s kubectl set image deployment/nemo-embedding-embedding-deployment \
  embedding-container=nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.5.0

microk8s kubectl set image deployment/nemo-rerank-ranking-deployment \
  ranking-container=nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.5.0

(Helm values configuration)

global:
  ngcImagePullSecretName: ngc-docker-reg-secret
nemo-embedding:
  applicationSpecs:
    embedding-deployment:
      containers:
        embedding-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4"
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
    tag: 1.5.0
  resources:
    limits:
      nvidia.com/gpu: 0
nemo-rerank:
  applicationSpecs:
    ranking-deployment:
      containers:
        ranking-container:
          env:
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          - name: NVIDIA_VISIBLE_DEVICES
            value: "4"
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
    tag: 1.5.0
  resources:
    limits:
      nvidia.com/gpu: 0
nim-llm:
  env:
  - name: NVIDIA_VISIBLE_DEVICES
    value: 0,1,2,3
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        key: OPENAI_API_KEY
        name: openai-api-key-secret
  - name: OPENAI_API_KEY_NAME
    value: OPENAI_API_KEY
  image:
    repository: nvcr.io/nim/meta/llama-3.1-70b-instruct
    tag: 1.10.1
  resources:
    limits:
      nvidia.com/gpu: 0
vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          env:
          - name: VLM_MODEL_TO_USE
            value: vila-1.5
          - name: OPENAI_API_KEY
            valueFrom:
              secretKeyRef:
                key: OPENAI_API_KEY
                name: openai-api-key-secret
          - name: OPENAI_API_KEY_NAME
            value: OPENAI_API_KEY
          - name: MODEL_PATH
            value: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
          - name: NVIDIA_VISIBLE_DEVICES
            value: 4,5,6,7
          - name: ASSET_STORAGE_DIR
            value: /tmp/custom-asset-dir
          - name: EXAMPLE_STREAMS_DIR
            value: /tmp/custom-example-streams-dir
          - name: NGC_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-key-secret
          startupProbe:
            failureThreshold: 360
  configs:
    ca_rag_config.yaml:
      chat:
        embedding:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        llm:
          base_url: http://llm-nim-svc:8000/v1
          model: meta/llama-3.1-70b-instruct
        reranker:
          base_url: http://nemo-rerank-ranking-deployment-ranking-service:8000/v1
      notification:
        llm:
          base_url: http://llm-nim-svc:8000/v1
          model: meta/llama-3.1-70b-instruct
      summarization:
        embedding:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        llm:
          base_url: http://llm-nim-svc:8000/v1
          model: meta/llama-3.1-70b-instruct
    guardrails_config.yaml:
      models:
      - engine: nim
        model: meta/llama-3.1-70b-instruct
        parameters:
          base_url: http://llm-nim-svc:8000/v1
        type: main
      - engine: nim
        model: nvidia/llama-3.2-nv-embedqa-1b-v2
        parameters:
          base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
        type: embeddings
  extraPodVolumeMounts:
  - mountPath: /tmp/custom-asset-dir
    name: custom-asset-dir
  - mountPath: /tmp/custom-example-streams-dir
    name: custom-example-streams-dir
  extraPodVolumes:
  - hostPath:
      path: /home/nvadmin/Workspace/blueprint/video_uploads
    name: custom-asset-dir
  - hostPath:
      path: /home/nvadmin/Workspace/blueprint/video_examples
    name: custom-example-streams-dir
  resources:
    limits:
      nvidia.com/gpu: 0

When using vila-1.5, the following error occurs:

[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
Loading checkpoint shards: 100%|██████████| 15/15 [00:18<00:00,  1.23s/it]
Downloading readme: 15.6kB [00:00, 51.0MB/s]
Downloading data: 100%|██████████| 257M/257M [00:05<00:00, 45.3MB/s]
Downloading data: 100%|██████████| 257M/257M [00:05<00:00, 45.3MB/s]
Downloading data: 100%|██████████| 259M/259M [00:05<00:00, 47.7MB/s]
Downloading data: 100%|██████████| 34.7M/34.7M [00:01<00:00, 32.7MB/s]
Downloading data: 100%|██████████| 30.0M/30.0M [00:00<00:00, 30.9MB/s]
Generating train split: 100%|██████████| 287113/287113 [00:02<00:00, 104695.61 examples/s]
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 98962.63 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 105142.65 examples/s]
Inserted 1263 quantizers
Traceback (most recent call last):
  File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 167, in <module>
    quantize_and_export(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 747, in quantize_and_export
    model = quantize_model(model, quant_cfg, calib_dataloader, batch_size,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 542, in quantize_model
    mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py", line 234, in quantize
    return calibrate(model, config["algorithm"], forward_loop=forward_loop)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py", line 106, in calibrate
    awq(model, algorithm, forward_loop, **kwargs)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 468, in awq
    awq_lite(model, forward_loop, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 635, in awq_lite
    module.awq_lite = AWQLiteHelper(module)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 529, in __init__
    self.weight_scale = get_weight_scale(module.weight, self.block_size)
  File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 544, in get_weight_scale
    weight_abs_amax = weight.abs().amax(dim=1, keepdim=True)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR: Failed to convert checkpoint
2025-07-02 01:07:50,751 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Traceback (most recent call last):
  File "/tmp/via/via-engine/via_server.py", line 1368, in run
    self._stream_handler = ViaStreamHandler(self._args)
  File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 416, in __init__
    self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 1270, in __init__
    raise Exception("Failed to generate TRT-LLM engine")
Exception: Failed to generate TRT-LLM engine

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/via/via-engine/via_server.py", line 2880, in <module>
    server.run()
  File "/tmp/via/via-engine/via_server.py", line 1370, in run
    raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Killed process with PID 149

We also tried using nvila, but encountered the same error:

env:
- name: VLM_MODEL_TO_USE
  value: nvila
- name: MODEL_PATH
  value: git:https://huggingface.co/Efficient-Large-Model/NVILA-15B

nvilla Error

[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:27] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 11.644 sec
Process VlmProcess-5:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:27] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
VILA TRT model load execution time = 11.905 sec
Process VlmProcess-3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 11.969 sec
Process VlmProcess-6:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.078 sec
Process VlmProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.151 sec
Process VlmProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.194 sec
Process VlmProcess-4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
    if not self._initialize():
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
    self._model = NVila(
  File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
    self._llm = LLM(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
    self._build_model()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
    self.input_processor = create_input_processor(self.args.model,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
    from tensorrt_llm._torch.models import get_model_architecture
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
    from .modeling_auto import AutoModelForCausalLM
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
    from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
    from ..attention_backend import AttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
    from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
    check_cuda_arch()
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
    for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
    raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported

The embedding and rerank models are both working properly.
Currently, we are temporarily using OpenAI instead of the vila or nvila models, but we would prefer to use vila or nvila if possible.
Thank you

The platforms we currently support are as follows.
https://docs.nvidia.com/vss/latest/content/supported_platforms.html
B200 may not be able to support for the tensorrt-llm in the VSS.

I see that B200 GPUs are supported starting from TensorRT-LLM version 0.17.0. However, even though our current TensorRT-LLM version is higher than that, it is not working. In our server environment, PyTorch 2.7.1 (stable) with CUDA 12.8 seems to be compatible. VSS returns a torch-related error. Is it possible to manually upgrade the CUDA version or other dependencies in the VSS image to resolve this? Or are there any plans to expand B200 support in VSS in the future?

Because there are many modules involved in the VSS, this requires an adaptation process.

Could you try to use our latest version: vss-2-3-1?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Version 2.3.1 supports the B200, so we’ve applied it. Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.