Can't start the live llava on jetson orin nano developer kit

vincent.nguyen · April 25, 2024, 8:00am

Hi,

I am following the tutorial at https://www.jetson-ai-lab.com/tutorial_live-llava.html. And when I run the command

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.agents.video_query --api=mlc \
    --model Efficient-Large-Model/VILA-2.7b \
    --max-context-len 768 \
    --max-new-tokens 32 \
    --video-input /dev/video0 \
    --video-output webrtc://@:8554/output

I am getting error

dustynv/nano_llm:r36.2.0
localuser:root being added to access control list
xauth:  file /tmp/.docker.xauth does not exist
+ sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/jetsonano/Documents/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/video0 --device /dev/video1 dustynv/nano_llm:r36.2.0 python3 -m nano_llm.agents.video_query --api=mlc --model Efficient-Large-Model/VILA-2.7b --max-new-tokens 32 --video-input /dev/video0 --video-output webrtc://@:8554/output
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 64527.75it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 49587.83it/s]
06:36:25 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA-2.7b/snapshots/2ed82105eefd5926cccb46af9e71b0ca77f12704 with MLC
06:36:27 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors 


Using path "/data/models/mlc/dist/models/VILA-2.7b" for model "VILA-2.7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                                                                                                                           | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                                                                                                | 0/327 [00:00<?, ?tensors/s]
Get old param:   1%|█▍                                                                                                                                                 | 2/197 [00:03<04:08,  1.27s/tensors]Process Process-1:%|▍                                                                                                                                                  | 1/327 [00:03<16:31,  3.04s/tensors]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 358, in <module>
    agent = VideoQuery(**vars(args)).run() 
  File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 44, in __init__
    self.llm = ProcessProxy('ChatQuery', model=model, drop_inputs=True, vision_scaling=vision_scaling, **kwargs) #ProcessProxy((lambda **kwargs: ChatQuery(model, drop_inputs=True, **kwargs)), **kwargs)
  File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 38, in __init__
    raise RuntimeError(f"subprocess has an invalid initialization status ({init_msg['status']})")
RuntimeError: subprocess has an invalid initialization status (<class 'subprocess.CalledProcessError'>)
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 132, in run_process
    raise error
  File "/opt/NanoLLM/nano_llm/plugins/process_proxy.py", line 126, in run_process
    self.plugin = ChatQuery(**kwargs)
  File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 70, in __init__
    self.model = NanoLLM.from_pretrained(model, **kwargs)
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in __init__
    quant = MLCModel.quantize(model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors ' died with <Signals.SIGKILL: 9>.

Below is my nvidia-jetpack info

sudo apt-cache show nvidia-jetpack
[sudo] password for jetsonano: 
Package: nvidia-jetpack
Version: 6.0-b52
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-jetpack-runtime (= 6.0-b52), nvidia-jetpack-dev (= 6.0-b52)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_6.0-b52_arm64.deb
Size: 29294
SHA256: 01f3cfaed6f45ebabacbe5f2d4c3b74a296200ae928d68b97956470d54c4be98
SHA1: 950626b2b51381650e8ecb7e3b21f5e2e89cddb6
MD5sum: 1e58b6faa4b7a9695a1f5b0cb6035d85
Description: NVIDIA Jetpack Meta Package

Can someone please help me?

dusty_nv · April 25, 2024, 1:51pm

@vincent.nguyen please try running the chat application first to make sure the model is working for you:

jetson-containers run $(autotag nano_llm) \
  python3 -m nano_llm.chat --api=mlc \
    --model Efficient-Large-Model/VILA-2.7b \
    --max-context-len 768 \
    --max-new-tokens 128 \
    --prompt /data/prompts/images.json

vincent.nguyen · April 26, 2024, 10:52am

Hi @dusty_nv , I tried your suggestion

jetson-containers run   --env HUGGINGFACE_TOKEN=hf_xxxxxx   $(autotag nano_llm)   python3 -m nano_llm.chat --api mlc     --model Efficient-Large-Model/VILA-2.7b     --prompt "Can you tell me a joke about llamas?"

It is still showing error

 jetson-containers run   --env HUGGINGFACE_TOKEN=hf_xxxxxx   $(autotag nano_llm)   python3 -m nano_llm.chat --api mlc     --model Efficient-Large-Model/VILA-2.7b     --prompt "Can you tell me a joke about llamas?"
Namespace(packages=['nano_llm'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.2.0  JETPACK_VERSION=6.0  CUDA_VERSION=12.2
-- Finding compatible container image for ['nano_llm']
dustynv/nano_llm:r36.2.0
localuser:root being added to access control list
+ sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/jetsonano/Documents/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/video0 --device /dev/video1 --env HUGGINGFACE_TOKEN=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 dustynv/nano_llm:r36.2.0 python3 -m nano_llm.chat --api mlc --model Efficient-Large-Model/VILA-2.7b --prompt 'Can you tell me a joke about llamas?'
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /data/models/huggingface/token
Login successful
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 60611.33it/s]
Fetching 12 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 60422.15it/s]
10:50:04 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA-2.7b/snapshots/2ed82105eefd5926cccb46af9e71b0ca77f12704 with MLC
10:50:06 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors 


Using path "/data/models/mlc/dist/models/VILA-2.7b" for model "VILA-2.7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                                                                                                                           | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                                                                                                | 0/327 [00:00<?, ?tensors/s]
Get old param:   1%|█▍                                                                                                                                                 | 2/197 [00:03<04:08,  1.27s/tensors]Traceback (most recent call last):                                                                                                                                     | 1/327 [00:03<16:28,  3.03s/tensors]
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 29, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in __init__
    quant = MLCModel.quantize(model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors ' died with <Signals.SIGKILL: 9>.

vincent.nguyen · April 30, 2024, 5:17am

Hi Dustin, I have reflash and reinstalled sdk components for the Orin Nano.

And the error still occurs. I even ran the command inside the container and it is still not working.

 jetson-containers run $(autotag nano_llm)

root@ubuntu:/# python3 -m nano_llm.chat --model Efficient-Large-Model/VILA-2.7b --api=mlc --quantization q4f16_ft
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 40021.98it/s]
Fetching 12 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 69711.42it/s]
05:07:43 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA-2.7b/snapshots/2ed82105eefd5926cccb46af9e71b0ca77f12704 with MLC
05:07:44 | INFO | running MLC quantization:

python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors 


Using path "/data/models/mlc/dist/models/VILA-2.7b" for model "VILA-2.7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param:   0%|                                                                                                                                      | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while.                                                                                           | 0/327 [00:00<?, ?tensors/s]
Get old param:   1%|█▎                                                                                                                            | 2/197 [00:02<04:00,  1.23s/tensors]Traceback (most recent call last):                                                                                                                | 1/327 [00:02<16:00,  2.95s/tensors]
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NanoLLM/nano_llm/chat/__main__.py", line 29, in <module>
    model = NanoLLM.from_pretrained(
  File "/opt/NanoLLM/nano_llm/nano_llm.py", line 71, in from_pretrained
    model = MLCModel(model_path, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 59, in __init__
    quant = MLCModel.quantize(model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
  File "/opt/NanoLLM/nano_llm/models/mlc.py", line 278, in quantize
    subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA-2.7b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist/VILA-2.7b-ctx4096 --use-safetensors ' died with <Signals.SIGKILL: 9>.

dusty_nv · May 1, 2024, 10:08pm

Hi Vincent - from this message, you can tell that it is running out of memory. If you haven’t already, can you try mounting SWAP, disabling ZRAM, and disabling the desktop UI if necessary?

github.com

dusty-nv/jetson-containers/blob/master/docs/setup.md#mounting-swap

# System Setup

Install the latest version of JetPack 4 on Nano/TX1/TX2, JetPack 5 on Xavier, or JetPack 6 on Orin.  The following versions are supported:

* JetPack 4.6.1+ (>= L4T R32.7.1)
* JetPack 5.1+  (>= L4T R35.2.1)
* JetPack 6.0 DP (L4T R36.2.0)
> [!NOTE]  
> <sup>- Building on/for x86 platforms isn't supported at this time (one can typically install/run packages the upstream way there)</sup><br>
> <sup>- The below steps are optional for [pulling/running](/docs/run.md) existing container images from registry, but recommended for building containers locally.</sup>

## Clone the Repo

This will download and install the jetson-containers utilities:

```bash
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
```

This file has been truncated. show original

Also it looks like you missed using the --max-context-len 768 flag - that should also reduce the memory usage.

vincent.nguyen · May 7, 2024, 6:00am

Thanks, Dustin. After mounting the swap, the error SIGKILL is gone. However, I still can’t manage to run the live llava on the Orin nano. I disabled the GUI but the script just hangs, without any error coming.

root@ubuntu:/app# python3 -m nano_llm.agents.video_query --api=mlc     --model Efficient-Large-Model/VILA1.5-3b --max-context-len 64     --max-new-tokens 32     --video-input /dev/video0     --video-output webrtc://@:8554/output
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 65638.56it/s]
Fetching 12 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 68947.46it/s]
05:33:39 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA-2.7b/snapshots/2ed82105eefd5926cccb46af9e71b0ca77f12704 with MLC
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
05:33:42 | INFO | device=cuda(0), name=Orin, compute=8.7, max_clocks=624000, multiprocessors=8, max_thread_dims=[1024, 1024, 64], api_version=12020, driver_version=None
05:33:42 | INFO | loading VILA-2.7b from /data/models/mlc/dist/VILA-2.7b-ctx64/VILA-2.7b-q4f16_ft/VILA-2.7b-q4f16_ft-cuda.so
05:33:42 | WARNING | model library /data/models/mlc/dist/VILA-2.7b-ctx64/VILA-2.7b-q4f16_ft/VILA-2.7b-q4f16_ft-cuda.so was missing metadata
05:33:44 | INFO | loading clip vision model openai/clip-vit-large-patch14-336
<class 'nano_llm.vision.clip.CLIPImageEmbedding.__init__.<locals>.VisionEncoder'> openai/clip-vit-large-patch14-336 VisionEncoder(
  (model): CLIPVisionModelWithProjection(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(577, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            )
            (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (mlp): CLIPMLP(
              (activation_fn): QuickGELUActivation()
              (fc1): Linear(in_features=1024, out_features=4096, bias=True)
              (fc2): Linear(in_features=4096, out_features=1024, bias=True)
            )
            (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
    (visual_projection): Linear(in_features=1024, out_features=768, bias=False)
  )
)
┌──────────────┬───────────────────────────────────┐
│ name         │ openai/clip-vit-large-patch14-336 │
├──────────────┼───────────────────────────────────┤
│ input_shape  │ (336, 336)                        │
├──────────────┼───────────────────────────────────┤
│ output_shape │ torch.Size([1, 768])              │
└──────────────┴───────────────────────────────────┘
05:34:21 | INFO | optimizing openai/clip-vit-large-patch14-336 with TensorRT...
[05/07/2024-05:34:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 482, GPU 6295 (MiB)
[05/07/2024-05:34:22] [TRT] [V] Trying to load shared library libnvinfer_builder_resource.so.8.6.2
[05/07/2024-05:34:22] [TRT] [V] Loaded shared library libnvinfer_builder_resource.so.8.6.2
[05/07/2024-05:34:30] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1106, now: CPU 1672, GPU 7403 (MiB)
[05/07/2024-05:34:30] [TRT] [V] CUDA lazy loading is enabled.
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:279: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:319: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):

I think 8Gb Memory is not enough to run the demo.

dusty_nv · May 7, 2024, 1:47pm

Hi @vincent.nguyen, I think the VILA-3B uses larger SigLIP-384x384 vision encoder that takes more memory to build the TensorRT engine, and I copied that built engine over from an AGX Orin which was able to build it. While I figure out how to redistribute these for Nano users, can you try running it with --vision-api=hf instead (this should skip use of TRT for the vision encoder)

vincent.nguyen · May 8, 2024, 12:22am

Hi Dustin,

Thanks for the suggestion. I tried the command with --vision-api=hf, and it works.

[11:58] Info
jetson-containers run $(autotag nano_llm)   python3 -m nano_llm.agents.video_query --api=mlc     --model Efficient-Large-Model/VILA1.5-3b     --max-context-len 256     --max-new-tokens 32     --video-input /dev/video0     --video-output webrtc://@:8554/output    --vision-api=hf

Although I got this warning

I think for Jetson Orin Nano, the process of converting the SigLIP-384x384 vision encoder into a TensorRT-optimized format is still a challenge on devices with limited resources.

dusty_nv · May 8, 2024, 2:26pm

OK great @vincent.nguyen, glad you were able to get it running at least. I think for the CLIP/SigLIP TRT engines, I will need to make some repo on HuggingFace Hub to redistribute them for Nano.

system · June 4, 2024, 4:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NanoVLM Issue on Jetson Orin Nano Jetson Orin Nano generative_ai	9	573	June 6, 2024
Errors on tutorial NanoVLM Jetson Orin Nano generative_ai	4	410	May 28, 2024
Live Llava on Orin Jetson Projects generative_ai	18	1919	January 22, 2025
Running LLAVA live on Jetson orin nx(16 GB) with nvidia jetpack 5.1.1 Jetson Orin NX generative_ai	4	761	March 21, 2024
MiniGPT-4 on Jetson Orin Nano 8Gb Dev kit not working Jetson Orin Nano generative_ai	9	315	May 28, 2024
Cannot run LLaVa with Orin NX Jetson Orin NX generative_ai	7	238	August 1, 2024
TensorRT-LLM for jetson errors Jetson AGX Orin generative_ai , paligemma , kosmos-2 , llama	14	273	January 16, 2025
I want to try LLaVa with Jetson Orin Jetson AGX Orin generative_ai	5	838	March 10, 2024
Memory exhausted when loading LLM and rebooted Jetson Nano Super Jetson Orin Nano generative_ai	3	78	January 24, 2025
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	23326	May 10, 2024

Can't start the live llava on jetson orin nano developer kit

Related topics