VSS Jetson Thor NVILA deployment libcublas.so.12 import error

Please provide the following information when creating a topic:

  • Hardware Platform (GPU model and numbers): Jetson Thor
  • Ubuntu Version: 24.04.3
  • NVIDIA GPU Driver Version (valid for GPU only): 580.00
  • Jetpack version: 7.0
  • CUDA version: 13.0

Hello, I’m attempting to deploy NVILA model using docker remote_llm_deployment on a Jetson Thor, but I’m getting these errors:

`

via-server-1 | GPU has 2 decode engines
via-server-1 | Free GPU memory is [N/A] MiB
via-server-1 | /opt/nvidia/via/start_via.sh: line 77: [: [N/A]: integer expression expected
via-server-1 | Total GPU memory is 125772 MiB per GPU
via-server-1 | Auto-selecting VLM Batch Size to 128
via-server-1 | Using nvila
via-server-1 | Starting VIA server in release mode
via-server-1 | 2026-01-20 14:53:53,240 INFO Initializing VIA Stream Handler
via-server-1 | 2026-01-20 14:53:53,243 INFO {‘gdino_engine’: ‘/root/.via/ngc_model_cache//cv_pipeline_models/swin.fp16.engine’, ‘tracker_config’: ‘/tmp/via_tracker_config.yml’, ‘inference_interval’: 1}
via-server-1 | 2026-01-20 14:53:53,243 INFO Initializing VLM pipeline
via-server-1 | 2026-01-20 14:53:53,252 INFO Have peer access: True
via-server-1 | 2026-01-20 14:53:53,253 INFO Using model cached at /root/.via/ngc_model_cache/nvidia_tao_nvila-highres_nvila-lite-15b-highres-lita
via-server-1 | 2026-01-20 14:53:53,254 INFO GPUs per VLM instance: 1
via-server-1 | 2026-01-20 14:53:53,254 INFO num_vlm_procs set to 1
via-server-1 | INFO: Started server process [160]
via-server-1 | INFO: Waiting for application startup.
via-server-1 | INFO: Application startup complete.
via-server-1 | INFO: Uvicorn running on http://127.0.0.1:60000 (Press CTRL+C to quit)
via-server-1 | 2026-01-20 14:53:55,695 INFO Initializing VlmProcess-0
via-server-1 | 2026-01-20 14:53:55,695 INFO Initializing DecoderProcess-0
via-server-1 | Process VlmProcess-1:
via-server-1 | Traceback (most recent call last):
via-server-1 | File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
via-server-1 | self.run()
via-server-1 | File “/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py”, line 240, in run
via-server-1 | if not self._initialize():
via-server-1 | ^^^^^^^^^^^^^^^^^^
via-server-1 | File “/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py”, line 730, in _initialize
via-server-1 | self._model = NVila(
via-server-1 | ^^^^^^
via-server-1 | File “/opt/nvidia/via/via-engine/models/nvila/nvila_model.py”, line 44, in init
via-server-1 | import tensorrt_llm
via-server-1 | File “/usr/local/lib/python3.12/dist-packages/tensorrt_llm/init.py”, line 32, in
via-server-1 | import tensorrt_llm.functional as functional
via-server-1 | File “/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py”, line 28, in
via-server-1 | from . import graph_rewriting as gw
via-server-1 | File “/usr/local/lib/python3.12/dist-packages/tensorrt_llm/graph_rewriting.py”, line 11, in
via-server-1 | from ._utils import trt_gte
via-server-1 | File “/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_utils.py”, line 42, in
via-server-1 | from tensorrt_llm.bindings import DataType, GptJsonConfig
via-server-1 | ImportError: libcublas.so.12: cannot open shared object file: No such file or directory
via-server-1 | 2026-01-20 14:53:58,787 INFO Warmup DecoderProcess-0
via-server-1 | /bin/dash: 1: lsmod: not found
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:53:59,150 INFO Video stream found.
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:53:59,498 INFO Video stream found.
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:53:59,748 INFO Video stream found.
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:53:59,982 INFO Video stream found.
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:54:00,220 INFO Video stream found.
via-server-1 | Opening in BLOCKING MODE
via-server-1 | 2026-01-20 14:54:00,462 INFO Video stream found.
via-server-1 | 2026-01-20 14:54:00,557 INFO Warmup DecoderProcess-0 done
via-server-1 | 2026-01-20 14:54:00,559 INFO Initialized DecoderProcess-0
via-server-1 | 2026-01-20 14:54:01,561 INFO Stopping VLM pipeline
via-server-1 | [W120 14:54:02.704658614 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
via-server-1 | 2026-01-20 14:54:02,778 INFO Stopped VLM pipeline
via-server-1 | 2026-01-20 14:54:02,779 ERROR Failed to load VIA stream handler - Failed to load VLM on GPU 0
via-server-1 | Traceback (most recent call last):
via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 254, in run
via-server-1 | self._stream_handler = ViaStreamHandler(self._args)
via-server-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
via-server-1 | File “/opt/nvidia/via/via-engine/via_stream_handler.py”, line 592, in init
via-server-1 | self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
via-server-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
via-server-1 | File “/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py”, line 1560, in init
via-server-1 | raise Exception(f"Failed to load VLM on GPU {idx}“)
via-server-1 | Exception: Failed to load VLM on GPU 0
via-server-1 |
via-server-1 | During handling of the above exception, another exception occurred:
via-server-1 |
via-server-1 | Traceback (most recent call last):
via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 2371, in
via-server-1 | server.run()
via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 256, in run
via-server-1 | raise ViaException(f"Failed to load VIA stream handler - {str(ex)}”)
via-server-1 | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to load VLM on GPU 0
via-server-1 | Killed process with PID 157
via-server-1 exited with code 1

`

What I’ve tried:
Both via-server images (non-sbsa and sbsa)
setting NVILA_USE_PYTORCH=1

Is there anything else I could try?

Please refer to this patch.

1.Since trt-llm does not support Jetson Thor, we can perform vllm inference using pytorch.

diff --git a/deploy/docker/remote_llm_deployment/.env b/deploy/docker/remote_llm_deployment/.env
index 97e849e..e0e17d2 100644
--- a/deploy/docker/remote_llm_deployment/.env
+++ b/deploy/docker/remote_llm_deployment/.env
@@ -1,5 +1,5 @@
-export NGC_API_KEY=abc123*** #FIXME - api key to pull model from NGC. Should come from ngc.nvidia.com
-export NVIDIA_API_KEY=nvapi-*** #api key to access NIM endpoints. Should come from build.nvidia.com
+export NGC_API_KEY=nvapi #FIXME - api key to pull model from NGC. Should come from ngc.nvidia.com
+export NVIDIA_API_KEY=nvapi #api key to access NIM endpoints. Should come from build.nvidia.com
 
 #Adjust ports if needed
 export FRONTEND_PORT=9100
@@ -27,10 +27,17 @@ export DISABLE_CV_PIPELINE=true
 export INSTALL_PROPRIETARY_CODECS=false # Set to true when enabling CV
 
 #Set VLM to Cosmos-Reason1
-export VLM_MODEL_TO_USE=cosmos-reason1
-export MODEL_PATH=ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic
+# export VLM_MODEL_TO_USE=cosmos-reason1
+# export MODEL_PATH=ngc:nim/nvidia/cosmos-reason1-7b:1.1-fp8-dynamic
+
+export VLM_MODEL_TO_USE=nvila
+export MODEL_PATH=git:https://huggingface.co/Efficient-Large-Model/NVILA-15B
+export NVILA_VIDEO_MAX_TILES=1
+export NVILA_USE_PYTORCH=true
 
 #Adjust misc configs if needed
 export DISABLE_GUARDRAILS=false
 export NVIDIA_VISIBLE_DEVICES=0 #For H100 Deployment

# cache model to host, avoid download again and again
+export NGC_MODEL_CACHE=xxxx/.cache/.vss/ngc_model_cache
+export TRT_ENGINE_PATH=xxxx/.cache/.vss/trt_engine_cache
 #export NVIDIA_VISIBLE_DEVICES=0,1,2 #For L40S Deployment
diff --git a/deploy/docker/remote_llm_deployment/compose.yaml b/deploy/docker/remote_llm_deployment/compose.yaml
index 1a13613..ab0ef34 100644
--- a/deploy/docker/remote_llm_deployment/compose.yaml
+++ b/deploy/docker/remote_llm_deployment/compose.yaml
@@ -39,8 +39,10 @@ services:
       - via-hf-cache:/tmp/huggingface
       - "${CV_PIPELINE_TRACKER_CONFIG:-/dummy}${CV_PIPELINE_TRACKER_CONFIG:+:/opt/nvidia/via/config/default_tracker_config.yml}"
       - "${ALERT_REVIEW_MEDIA_BASE_DIR:-/dummy}${ALERT_REVIEW_MEDIA_BASE_DIR:+:${ALERT_REVIEW_MEDIA_BASE_DIR:-}}"
+      - xxxxx/xvideo-search-and-summarization/src/vss-engine/src:/opt/nvidia/via/via-engine
 
     environment:
+      NVILA_USE_PYTORCH: "${NVILA_USE_PYTORCH:-true}"
       AZURE_OPENAI_API_KEY: "${AZURE_OPENAI_API_KEY:-}"
       AZURE_OPENAI_ENDPOINT: "${AZURE_OPENAI_ENDPOINT:-}"
       BACKEND_PORT: "${BACKEND_PORT?}"

2.You may encounter insufficient video memory issues on a Jetson Thor. Please use this script to clear system memory first.

cd video-search-and-summarization/deploy/scripts
sudo ./sys_cache_cleaner.sh 

3.I strongly recommend you use Cosmos. NVILA is an older VLM model and may not be supported in the future.

1 Like

It works. Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.