-
Hardware Platform (GPU model and numbers)
NVIDIA B200 * 8 - 1node -
System Memory
2.0Ti -
Ubuntu Version
24.04 -
NVIDIA GPU Driver Version (valid for GPU only)
570.124.06 -
Issue Type( questions, new requirements, bugs)
questions -
How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
Hello nvidia team,
We deployed vss-blueprint (nvidia-blueprint-vss-2.3.0) with the configuration values below, but an error occurs in the vila model.
Here are our Helm value settings. For the rerank and embedding models, specifying the image tag in the values did not take effect, so we updated the images manually using:
microk8s kubectl set image deployment/nemo-embedding-embedding-deployment \
embedding-container=nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.5.0
microk8s kubectl set image deployment/nemo-rerank-ranking-deployment \
ranking-container=nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.5.0
(Helm values configuration)
global:
ngcImagePullSecretName: ngc-docker-reg-secret
nemo-embedding:
applicationSpecs:
embedding-deployment:
containers:
embedding-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: "4"
image:
repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
tag: 1.5.0
resources:
limits:
nvidia.com/gpu: 0
nemo-rerank:
applicationSpecs:
ranking-deployment:
containers:
ranking-container:
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
- name: NVIDIA_VISIBLE_DEVICES
value: "4"
image:
repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
tag: 1.5.0
resources:
limits:
nvidia.com/gpu: 0
nim-llm:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: 0,1,2,3
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key-secret
- name: OPENAI_API_KEY_NAME
value: OPENAI_API_KEY
image:
repository: nvcr.io/nim/meta/llama-3.1-70b-instruct
tag: 1.10.1
resources:
limits:
nvidia.com/gpu: 0
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
key: OPENAI_API_KEY
name: openai-api-key-secret
- name: OPENAI_API_KEY_NAME
value: OPENAI_API_KEY
- name: MODEL_PATH
value: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
- name: NVIDIA_VISIBLE_DEVICES
value: 4,5,6,7
- name: ASSET_STORAGE_DIR
value: /tmp/custom-asset-dir
- name: EXAMPLE_STREAMS_DIR
value: /tmp/custom-example-streams-dir
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
key: NGC_API_KEY
name: ngc-api-key-secret
startupProbe:
failureThreshold: 360
configs:
ca_rag_config.yaml:
chat:
embedding:
base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
llm:
base_url: http://llm-nim-svc:8000/v1
model: meta/llama-3.1-70b-instruct
reranker:
base_url: http://nemo-rerank-ranking-deployment-ranking-service:8000/v1
notification:
llm:
base_url: http://llm-nim-svc:8000/v1
model: meta/llama-3.1-70b-instruct
summarization:
embedding:
base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
llm:
base_url: http://llm-nim-svc:8000/v1
model: meta/llama-3.1-70b-instruct
guardrails_config.yaml:
models:
- engine: nim
model: meta/llama-3.1-70b-instruct
parameters:
base_url: http://llm-nim-svc:8000/v1
type: main
- engine: nim
model: nvidia/llama-3.2-nv-embedqa-1b-v2
parameters:
base_url: http://nemo-embedding-embedding-deployment-embedding-service:8000/v1
type: embeddings
extraPodVolumeMounts:
- mountPath: /tmp/custom-asset-dir
name: custom-asset-dir
- mountPath: /tmp/custom-example-streams-dir
name: custom-example-streams-dir
extraPodVolumes:
- hostPath:
path: /home/nvadmin/Workspace/blueprint/video_uploads
name: custom-asset-dir
- hostPath:
path: /home/nvadmin/Workspace/blueprint/video_examples
name: custom-example-streams-dir
resources:
limits:
nvidia.com/gpu: 0
When using vila-1.5, the following error occurs:
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
Loading checkpoint shards: 100%|██████████| 15/15 [00:18<00:00, 1.23s/it]
Downloading readme: 15.6kB [00:00, 51.0MB/s]
Downloading data: 100%|██████████| 257M/257M [00:05<00:00, 45.3MB/s]
Downloading data: 100%|██████████| 257M/257M [00:05<00:00, 45.3MB/s]
Downloading data: 100%|██████████| 259M/259M [00:05<00:00, 47.7MB/s]
Downloading data: 100%|██████████| 34.7M/34.7M [00:01<00:00, 32.7MB/s]
Downloading data: 100%|██████████| 30.0M/30.0M [00:00<00:00, 30.9MB/s]
Generating train split: 100%|██████████| 287113/287113 [00:02<00:00, 104695.61 examples/s]
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 98962.63 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 105142.65 examples/s]
Inserted 1263 quantizers
Traceback (most recent call last):
File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 167, in <module>
quantize_and_export(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 747, in quantize_and_export
model = quantize_model(model, quant_cfg, calib_dataloader, batch_size,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 542, in quantize_model
mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py", line 234, in quantize
return calibrate(model, config["algorithm"], forward_loop=forward_loop)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_quant.py", line 106, in calibrate
awq(model, algorithm, forward_loop, **kwargs) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 468, in awq
awq_lite(model, forward_loop, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 635, in awq_lite
module.awq_lite = AWQLiteHelper(module)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 529, in __init__
self.weight_scale = get_weight_scale(module.weight, self.block_size)
File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/quantization/model_calib.py", line 544, in get_weight_scale
weight_abs_amax = weight.abs().amax(dim=1, keepdim=True)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR: Failed to convert checkpoint
2025-07-02 01:07:50,751 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Traceback (most recent call last):
File "/tmp/via/via-engine/via_server.py", line 1368, in run
self._stream_handler = ViaStreamHandler(self._args)
File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 416, in __init__
self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 1270, in __init__
raise Exception("Failed to generate TRT-LLM engine")
Exception: Failed to generate TRT-LLM engine
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/via/via-engine/via_server.py", line 2880, in <module>
server.run()
File "/tmp/via/via-engine/via_server.py", line 1370, in run
raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Killed process with PID 149
We also tried using nvila, but encountered the same error:
env:
- name: VLM_MODEL_TO_USE
value: nvila
- name: MODEL_PATH
value: git:https://huggingface.co/Efficient-Large-Model/NVILA-15B
nvilla Error
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:27] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 11.644 sec
Process VlmProcess-5:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:27] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
VILA TRT model load execution time = 11.905 sec
Process VlmProcess-3:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 11.969 sec
Process VlmProcess-6:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.078 sec
Process VlmProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.151 sec
Process VlmProcess-2:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
[07/02/2025-04:44:28] [TRT-LLM] [E] Failed to load tokenizer from /tmp/via-ngc-model-cache/NVILA-15B
VILA TRT model load execution time = 12.194 sec
Process VlmProcess-4:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py", line 235, in run
if not self._initialize():
File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 665, in _initialize
self._model = NVila(
File "/opt/nvidia/via/via-engine/models/nvila/nvila_model.py", line 50, in __init__
self._llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/llm.py", line 28, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 137, in __init__
raise e
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 132, in __init__
self._build_model()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 396, in _build_model
self.input_processor = create_input_processor(self.args.model,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/inputs/registry.py", line 91, in create_input_processor
from tensorrt_llm._torch.models import get_model_architecture
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/__init__.py", line 3, in <module>
from .modeling_auto import AutoModelForCausalLM
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_auto.py", line 4, in <module>
from .modeling_utils import (MODEL_CLASS_MAPPING, DecoderModelForCausalLM,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 14, in <module>
from ..attention_backend import AttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/__init__.py", line 16, in <module>
from .flashinfer import FlashInferAttention, FlashInferAttentionMetadata
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_torch/attention_backend/flashinfer.py", line 16, in <module>
check_cuda_arch()
File "/usr/local/lib/python3.10/dist-packages/flashinfer/jit/core.py", line 49, in check_cuda_arch
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1997, in _get_cuda_arch_flags
raise ValueError(f"Unknown CUDA arch ({arch}) or GPU not supported")
ValueError: Unknown CUDA arch (10.0) or GPU not supported
The embedding and rerank models are both working properly.
Currently, we are temporarily using OpenAI instead of the vila or nvila models, but we would prefer to use vila or nvila if possible.
Thank you