JetPack 7.2 / Jetson Linux r39.2 on Jetson AGX Orin Developer Kit — Getting Started and feedback thread

Hi everyone,

JetPack 7.2, with Jetson Linux (L4T) r39.2, is now available for Jetson AGX Orin Developer Kit.

The big news for AGX Orin Developer Kit in this release is that it now aligns much more closely with the standard Arm server software ecosystem used by platforms such as Jetson Thor and DGX Spark. In practical terms, AGX Orin can now run mainstream Arm64 “arm64‑SBSA” containers and binaries, rather than requiring separate Jetson‑specific builds.

In practice, this means more standard Arm64 software can run on Jetson without requiring Jetson-specific rebuilds or custom containers. One great example is vLLM: Once AGX Orin is flashed with Jetson ISO r39.2, AGX Orin can run the official vLLM container. (Steps detailed at the end of this post)

To get started, please follow the updated Getting Started guide:

👉 Jetson AGX Orin Developer Kit — Getting Started
https://docs.nvidia.com/jetson/agx-orin-devkit/user-guide/latest/quick_start.html

The guide covers the details and the different cases, including BSP prerequisites, creating the USB installer, choosing target storage, and first boot setup.

A few quick reminders before you start:

  • Please follow the online guide closely, especially if your kit has an older JetPack / Jetson Linux release installed.

  • The JetPack 7.2 flow uses a Jetson ISO on a USB flash drive as the installer.

  • To use Jetson ISO to update AGX Orin Developer Kit to Jetson Linux r39.2, the installed BSP needs to be L4T r35.5 or greater. If your kit is on an older release, please update to L4T r35.5+ first by using a host PC before following the JetPack 7.2 Jetson ISO flow.

  • If prompted for a QSPI capsule update during the Jetson ISO boot flow, please follow the prompt and and press `y` allow the update to complete before continuing installation.

Please share feedback in this thread

We would like to collect feedback from early users so we can improve the documentation and help others avoid common pitfalls.

If you run into an issue, please reply with:

  • Current / previous Jetson Linux version, if you can boot into Jetson Linux
cat /etc/nv_tegra_release
  • AGX Orin Developer Kit memory size: 32 GB or 64 GB

  • Current / previous Jetson UEFI firmware version, if available

  • Target storage used: eMMC or NVMe

  • Whether you used monitor-attached setup or headless setup

  • Host OS used to create the USB installer: Windows, macOS, or Linux

  • Where in the Getting Started flow you ran into trouble

  • Any relevant logs, screenshots, or exact error messages

Useful areas to report:

  • Unclear step in the Getting Started guide

  • USB installer creation or boot issue

  • BSP prerequisite confusion

  • Firmware / UEFI prerequisite confusion

  • QSPI capsule update prompt behavior

  • eMMC / NVMe storage selection confusion

  • First boot / oem-config issue

  • MAXN_SUPER power mode behavior on AGX Orin 32 GB

  • Manual flashing flow differences from older JetPack / Jetson Linux releases

  • Anything that worked, but was surprising or not obvious

Optional: validate the new software stack with upstream vLLM

After your AGX Orin Developer Kit is updated to JetPack 7.2, one useful way to validate the new SBSA-compatible environment is to run an upstream Arm64 container directly.

For example, the official upstream vLLM image:

https://hub.docker.com/layers/vllm/vllm-openai/v0.22.0-ubuntu2404/

now includes Arm64 support and sm_87, so it can run on Jetson Orin.

For AGX Orin, you can try a model such as google/gemma-3-4b-it. If you want to start smaller, google/gemma-3-1b-it should also work.

Note: Gemma models on Hugging Face require accepting the model license and using an HF token.

export HF_TOKEN=hf_xxx_your_token_here

sudo docker run --rm -it \
  --runtime=nvidia --gpus all \
  --network=host \
  --ipc=host \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.22.0-ubuntu2404 \
    --model google/gemma-3-4b-it \
    --dtype bfloat16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85

In another terminal:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-4b-it",
    "messages": [
      {
        "role": "user",
        "content": "Explain in one sentence why running an upstream Arm64 container on Jetson is useful."
      }
    ]
  }'

If you try this, please share whether it worked for you and include any changes you needed to make.

Thanks for helping us improve the JetPack 7.2 update experience for AGX Orin Developer Kit.

What is the system Python version? (EDIT: found 3.12.3 in installing JetPack 7.2)

Will I still be constrained to use the preinstalled TensorRT on system or I could finally download one in a virtual environment?

“GPU” sheet in jtop crash.

```
Traceback (most recent call last):
File “/usr/local/bin/jtop”, line 8, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.12/dist-packages/jtop/main.py”, line 160, in main
curses.wrapper(JTOPGUI, jetson, pages, init_page=args.page,
File “/usr/lib/python3.12/curses/init.py”, line 94, in wrapper
return func(stdscr, *args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/jtop/gui/jtopgui.py”, line 104, in init
self.run(loop, seconds)
File “/usr/local/lib/python3.12/dist-packages/jtop/gui/jtopgui.py”, line 139, in run
self.draw(page)
File “/usr/local/lib/python3.12/dist-packages/jtop/gui/jtopgui.py”, line 155, in draw
page.draw(self.key, self.mouse)
File “/usr/local/lib/python3.12/dist-packages/jtop/gui/pgpu.py”, line 164, in draw
scaling_string = “Active” if gpu_status[‘3d_scaling’] else “Disable”
~~~~~~~~~~^^^^^^^^^^^^^^
KeyError: ‘3d_scaling’
```

jtop version: 4.3.2

you need to upgrade to the last version of jetson stats
sudo pip3 install --break-system-packages git+https://github.com/rbonghi/jetson_stats.git

Hello,
Worked on the upgrade from Jetpack6.2 to 7.2. Tried with USB Key upgrade, the BSP upgrade went into a loop and was not upgrading. Tried through ubuntu host with SDKM (Graphical SDKM seg fault on my laptop, but CLI worked) and it worked.
Installed latest jetstat and got the VLLM docker. The docker is pretty long to start like give it 15 min, also i had to reduce the memorry from 85% to 80%. It seems to detect the correct GPU, but then in jetstats only runs on CPU, even if GPU memory is filling up.

sudo docker run --rm -it --runtime=nvidia --gpus all --network=host --ipc=host -e HF_TOKEN=$HF_TOKEN -v ~/.cache/huggingface:/root/.cache/huggingface vllm/vllm-openai:v0.22.0-ubuntu2404 --model google/gemma-3-4b-it --dtype bfloat16 --max-model-len 8192 --gpu-memory-utilization 0.80
WARNING 06-05 15:29:39 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version.
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344]
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344] █ █ █▄ ▄█
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.0
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344] █▄█▀ █ █ █ █ model google/gemma-3-4b-it
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:344]
(APIServer pid=1) INFO 06-05 15:29:39 [utils.py:278] non-default args: {‘model_tag’: ‘google/gemma-3-4b-it’, ‘model’: ‘google/gemma-3-4b-it’, ‘dtype’: ‘bfloat16’, ‘max_model_len’: 8192, ‘gpu_memory_utilization’: 0.8}
(APIServer pid=1) WARNING 06-05 15:29:39 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 06-05 15:29:39 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-05 15:29:39 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-05 15:29:39 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(APIServer pid=1) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(APIServer pid=1) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(APIServer pid=1) - 9.0 which supports hardware CC >=9.0,<10.0
(APIServer pid=1) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(APIServer pid=1) - 11.0 which supports hardware CC >=11.0,<12.0
(APIServer pid=1) - 12.0 which supports hardware CC >=12.0,<13.0
(APIServer pid=1) _warn_unsupported_code(d, device_cc, code_ccs)
(APIServer pid=1) INFO 06-05 15:30:05 [model.py:617] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=1) INFO 06-05 15:30:05 [model.py:1752] Using max model len 8192
(APIServer pid=1) INFO 06-05 15:30:08 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-05 15:30:08 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) WARNING 06-05 15:30:08 [cuda.py:243] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(EngineCore pid=114) INFO 06-05 15:30:44 [core.py:112] Initializing a V1 LLM engine (v0.22.0) with config: model=‘google/gemma-3-4b-it’, speculative_config=None, tokenizer=‘google/gemma-3-4b-it’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-3-4b-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::qwen_gdn_attention_core’, ‘vllm::gdn_attention_core_xpu’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::deepseek_v4_attention’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_vision_items_per_batch’: 0, ‘encoder_cudagraph_max_frames_per_batch’: None, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_rope_kvcache_cat_mla’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’]), enable_flashinfer_autotune=True, moe_backend=‘auto’, linear_backend=‘auto’)
(EngineCore pid=114) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(EngineCore pid=114) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=114) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=114) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=114) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=114) - 11.0 which supports hardware CC >=11.0,<12.0
(EngineCore pid=114) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=114) _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=114) INFO 06-05 15:30:52 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.174:33015 backend=nccl
(EngineCore pid=114) INFO 06-05 15:30:52 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=114) INFO 06-05 15:30:53 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=114) INFO 06-05 15:31:11 [gpu_model_runner.py:5037] Starting to load model google/gemma-3-4b-it…
(EngineCore pid=114) INFO 06-05 15:31:12 [interfaces.py:172] Contains out of vocabulary multimodal tokens? False
(EngineCore pid=114) INFO 06-05 15:31:12 [cuda.py:433] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=114) INFO 06-05 15:31:12 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=114) INFO 06-05 15:31:12 [vllm.py:977] Asynchronous scheduling is enabled.
(EngineCore pid=114) INFO 06-05 15:31:12 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(EngineCore pid=114) INFO 06-05 15:31:12 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore pid=114) INFO 06-05 15:31:12 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=114) INFO 06-05 15:41:29 [weight_utils.py:603] Time spent downloading weights for google/gemma-3-4b-it: 614.312779 seconds
(EngineCore pid=114) INFO 06-05 15:41:29 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 8.01 GiB. Available RAM: 11.23 GiB.
(EngineCore pid=114) INFO 06-05 15:41:29 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.01s/it]
(EngineCore pid=114)
(EngineCore pid=114) INFO 06-05 15:41:31 [default_loader.py:397] Loading weights took 2.16 seconds
(EngineCore pid=114) INFO 06-05 15:41:32 [gpu_model_runner.py:5132] Model loading took 8.61 GiB memory and 619.199022 seconds
(EngineCore pid=114) INFO 06-05 15:41:33 [gpu_model_runner.py:6136] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(EngineCore pid=114) INFO 06-05 15:41:48 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/ff4c8bb952/rank_0_0/backbone for vLLM’s torch.compile
(EngineCore pid=114) INFO 06-05 15:41:48 [backends.py:1148] Dynamo bytecode transform time: 13.73 s
(EngineCore pid=114) [rank0]:W0605 15:41:51.443000 114 torch/_inductor/utils.py:1731] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=114) INFO 06-05 15:41:59 [backends.py:378] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=114) INFO 06-05 15:42:15 [backends.py:393] Compiling a graph for compile range (1, 2048) takes 25.88 s
(EngineCore pid=114) INFO 06-05 15:42:22 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/e0fa11424d6f3360e6603cccc830ae2bcf88ee4d327c398fedc4549783deee47/rank_0_0/model
(EngineCore pid=114) INFO 06-05 15:42:22 [monitor.py:53] torch.compile took 47.23 s in total
(EngineCore pid=114) INFO 06-05 15:42:23 [monitor.py:81] Initial profiling/warmup run took 0.92 s
(EngineCore pid=114) WARNING 06-05 15:42:38 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-05 15:42:38 [gpu_model_runner.py:6279] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=114) INFO 06-05 15:42:41 [gpu_model_runner.py:6365] Estimated CUDA graph memory: 0.08 GiB total
(EngineCore pid=114) INFO 06-05 15:42:42 [gpu_worker.py:466] Available KV cache memory: 9.7 GiB
(EngineCore pid=114) INFO 06-05 15:42:42 [gpu_worker.py:481] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.8000 is equivalent to --gpu-memory-utilization=0.7973 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.8027. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=114) WARNING 06-05 15:42:42 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-05 15:42:42 [kv_cache_utils.py:1733] GPU KV cache size: 155,912 tokens
(EngineCore pid=114) INFO 06-05 15:42:42 [kv_cache_utils.py:1734] Maximum concurrency for 8,192 tokens per request: 19.03x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00, 7.46it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 8.94it/s]
(EngineCore pid=114) INFO 06-05 15:42:56 [gpu_model_runner.py:6456] Graph capturing finished in 13 secs, took 0.56 GiB
(EngineCore pid=114) INFO 06-05 15:42:56 [gpu_worker.py:619] CUDA graph pool memory: 0.56 GiB (actual), 0.08 GiB (estimated), difference: 0.47 GiB (85.3%).
(EngineCore pid=114) INFO 06-05 15:42:57 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=114) INFO 06-05 15:42:57 [core.py:302] init engine (profile, create kv cache, warmup model) took 84.53 s (compilation: 47.23 s)
(EngineCore pid=114) INFO 06-05 15:42:58 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) INFO 06-05 15:42:58 [api_server.py:592] Supported tasks: [‘generate’]
(APIServer pid=1) WARNING 06-05 15:42:58 [model.py:1509] Default vLLM sampling parameters have been overridden by the model’s generation_config.json: {'top_k': 64, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=1) INFO 06-05 15:43:02 [hf.py:488] Detected the chat template content format to be ‘openai’. You can set --chat-template-content-format to override this.
(APIServer pid=1) INFO 06-05 15:43:43 [base.py:224] Multi-modal warmup completed in 41.235s
(APIServer pid=1) INFO 06-05 15:43:43 [base.py:224] Readonly multi-modal warmup completed in 0.049s
(APIServer pid=1) INFO 06-05 15:43:43 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 06-05 15:43:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(EngineCore pid=114) WARNING 06-05 15:45:04 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO: 127.0.0.1:49420 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-05 15:45:15 [loggers.py:271] Engine 000: Avg prompt throughput: 2.7 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-05 15:45:25 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-05 15:46:25 [loggers.py:271] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-05 15:46:35 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-05 15:46:45 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 127.0.0.1:60326 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-05 15:46:55 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-05 15:47:05 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 127.0.0.1:37026 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-05 15:54:25 [loggers.py:271] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 21.3%
(APIServer pid=1) INFO 06-05 15:54:35 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 21.3%
(APIServer pid=1) INFO 06-05 15:55:35 [loggers.py:271] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 31.7%
(APIServer pid=1) INFO 06-05 15:55:45 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 31.7%
(APIServer pid=1) INFO 06-05 15:55:55 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 31.7%
(APIServer pid=1) INFO 06-05 15:56:05 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 20.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 31.7%
(APIServer pid=1) INFO: 127.0.0.1:53270 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-05 15:56:15 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 31.7%
(APIServer pid=1) INFO 06-05 15:56:25 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 31.7%
^C(EngineCore pid=114) INFO 06-05 15:57:04 [core.py:1266] Shutdown initiated (timeout=0)
(EngineCore pid=114) INFO 06-05 15:57:04 [core.py:1289] Shutdown complete
(APIServer pid=1) INFO 06-05 15:57:04 [launcher.py:137] Shutting down FastAPI HTTP server.
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.

curl http://localhost:8000/v1/chat/completions -H “Content-Type: application/json” -d ‘{
“model”: “google/gemma-3-4b-it”,
“messages”: [
{
“role”: “user”,
“content”: “Explain in one sentence why running an upstream Arm64 container on Jetson is useful.”
}
]
}’
{“id”:“chatcmpl-80c52dda1e12c8c8”,“object”:“chat.completion”,“created”:1780674858,“model”:“google/gemma-3-4b-it”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Running an upstream Arm64 container on a Jetson device allows you to leverage the Jetson’s hardware acceleration and processing power to efficiently execute applications originally designed for other Arm64 platforms, boosting performance and streamlining development.”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null},“logprobs”:null,“finish_reason”:“stop”,“stop_reason”:106,“token_ids”:null,“routed_experts”:null}],“service_tier”:null,“system_fingerprint”:“vllm-0.22.0-982a5dd9”,“usage”:{“prompt_tokens”:27,“total_tokens”:73,“completion_tokens”:46,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“prompt_text”:null,“kv_transfer_params”:null}efreeze@localhost:~$

curl http://localhost:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “google/gemma-3-4b-it”,
“messages”: [
{
“role”: “user”,
“content”: “what type of model are you and what can you do ? are you any good ?”
}
]
}’
{“id”:“chatcmpl-9e2089128879207d”,“object”:“chat.completion”,“created”:1780674934,“model”:“google/gemma-3-4b-it”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Okay, let’s break down what I am and what I can do.\n\nWhat Type of Model Am I?\n\nI’m a large language model, created by the Gemma team at Google DeepMind. More specifically, I’m an open-weights model. This means my model weights are publicly available for download and use – a key difference from some other AI models. I’m based on the Gemini family of models. \n\nWhat Can I Do?\n\nI’m designed to be a versatile conversational AI. Here’s a rundown of things I can do:\n\n* Generate Text: I can produce different creative text formats, like poems, code, scripts, musical pieces, email, letters, etc. I’ll try my best to fulfill all your requirements.\n* Answer Questions: I can answer your questions in an informative way, even if they are open ended, challenging, or strange.\n* Follow Instructions: I’m pretty good at following your instructions and completing your requests thoughtfully.\n* Translate Languages: I can translate text from one language to another.\n* Summarize Text: I can condense longer pieces of text into shorter, more manageable summaries.\n* Engage in Conversation: I can chat with you about a wide range of topics.\n* Creative Writing: I can help you brainstorm ideas, develop characters, or write drafts of stories and other creative content.\n* Code Generation: I can generate code in various programming languages.\n\n\n\nAm I Any Good?\n\nThat’s a tricky question! Here’s my honest assessment:\n\n* I’m still under development. I’m constantly being refined and improved by the Gemma team.\n* I have strengths: I’m generally pretty good at understanding and responding to a wide variety of prompts. I can often generate coherent and relevant text.\n* I also have limitations: Like all large language models, I can sometimes make mistakes. I might:\n * Hallucinate information: I can sometimes confidently state incorrect facts. Always double-check important information I provide.\n * Be biased: My training data reflects biases that exist in the real world, and I can inadvertently perpetuate those biases in my responses.\n * Struggle with complex reasoning: I’m not perfect at complex logic or nuanced understanding.\n * Lack real-world experience: I’ve only learned from text data, so I don’t have firsthand experience of the world.\n\nImportant Note: I don’t have access to real-time information or external tools like Google Search. My knowledge cutoff is a point in the past.\n\nHow to Get the Best Results From Me:\n\n* Be specific: The more clearly you state your request, the better I can understand and respond.\n* Provide context: Give me enough background information to help me understand what you’re looking for.\n* Iterate: If my first response isn’t quite right, try rephrasing your prompt or providing more guidance.\n\nI’m here to help in any way I can! Do you have a specific task you’d like me to try, or would you like me to elaborate on a particular area?”,“refusal”:null,“annotations”:null,“audio”:null,“function_call”:null,“tool_calls”:,“reasoning”:null},“logprobs”:null,“finish_reason”:“stop”,“stop_reason”:106,“token_ids”:null,“routed_experts”:null}],“service_tier”:null,“system_fingerprint”:“vllm-0.22.0-982a5dd9”,“usage”:{“prompt_tokens”:26,“total_tokens”:728,“completion_tokens”:702,“prompt_tokens_details”:null},“prompt_logprobs”:null,“prompt_token_ids”:null,“prompt_text”:null,“kv_transfer_params”:null}

there is no wifi driver for following, and tried install backport-iwlwifi-dkms with no luck, left my AGX Orin 64G without network, please advice on this

0001:01:00.0 Network controller: Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak] (rev 1a)

There is a PR to jetson_stats that fixes 2gpu page of jtop.
You might want to wait a few days, then run following to get the updated Jtop. If you need jtop now you can run this and then run it again in a week or so to get the 2gpu fix.

sudo -v
curl -LsSf https://raw.githubusercontent.com/rbonghi/jetson_stats/master/scripts/upgrade-jtop.sh | bash

Couldn’t install on SDK Manager (cli via docker) for me.

It takes forever and never finish installing. The Jetson Linux Image runs until 99% but stays here burning 100% CPU for hours (I select multiple times “Yes continue” option, hoping it would finish, but it doesn’t). Then I’ve given up and now it’s burning time in Host components / CUDA installing (48%) still burning 200% cpu but no progress in 2h.

If I list the process on my machine, I’m seeing this:

root        5801  0.0  0.0 1268260 14372 ?       Sl   11:24   0:01  \_ /usr/bin/containerd-shim-runc-v2 -namespace moby -id 4f0bb9fe891c95d23f7c1c28c4ec7cae17caac69ca902c954ee5dc914a09980f -address /run/containerd/
1001        5825  105  9.6 14802096 3097056 pts/0 R<sl+ 11:24 121:33      \_ sdkmanager --cli%
1001        5895  0.0  0.1 1012476 50052 ?       Ssl  11:24   0:00          \_ /opt/nvidia/sdkmanager/sdkmanager /opt/nvidia/sdkmanager/resources/app/output/dist/service/spawn-worker.js
1001        5896  1.5  0.4 1400268 130312 pts/0  Sl+  11:24   1:47          \_ /opt/nvidia/sdkmanager/sdkmanager /opt/nvidia/sdkmanager/resources/app/output/dist/service/downloadService.bundle.js
1001        6474  0.0  0.0   7940  4920 pts/1    S<s+ 11:28   0:00          \_ bash --rcfile /opt/nvidia/sdkmanager/resources/app/scripts/linux/sdkmanager.bashrc --noprofile
1001        6500  0.0  0.0   7984  4852 pts/2    S<s+ 11:28   0:00          \_ bash --rcfile /opt/nvidia/sdkmanager/resources/app/scripts/linux/sdkmanager.bashrc --noprofile
root        7617  0.0  0.0      0     0 ?        Z<s  11:36   0:00          \_ [sudo] <defunct>
root        7620  0.0  0.0      0     0 ?        Z<s  11:36   0:00          \_ [sudo] <defunct>
root        7685  0.0  0.0      0     0 ?        Z<s  11:36   0:00          \_ [sudo] <defunct>
root        8569  0.0  0.0      0     0 ?        Z<s  11:42   0:00          \_ [sudo] <defunct>
root        8572  0.0  0.0      0     0 ?        Z<s  11:42   0:00          \_ [sudo] <defunct>
root        8575  0.0  0.0      0     0 ?        Z<s  11:42   0:00          \_ [sudo] <defunct>
1001        8828  0.0  0.0      0     0 pts/2    Z<   11:44   0:00          \_ [NV_L4T_FILE_SYS] <defunct>
root        8830  0.0  0.0      0     0 ?        Z<s  11:44   0:00          \_ [sudo] <defunct>
root       10369  0.0  0.0      0     0 ?        Z<s  11:45   0:00          \_ [sudo] <defunct>
root       10372  0.0  0.0      0     0 ?        Z<s  11:45   0:00          \_ [sudo] <defunct>
root       10418  0.0  0.0      0     0 ?        Z<s  11:45   0:00          \_ [sudo] <defunct>
root       10421  0.0  0.0      0     0 ?        Z<s  11:45   0:00          \_ [sudo] <defunct>
root       10428  0.0  0.0      0     0 ?        Z<s  11:45   0:00          \_ [sudo] <defunct>
root       10483  0.0  0.0      0     0 ?        Z<s  11:46   0:00          \_ [sudo] <defunct>
1001       11256  0.0  0.0      0     0 pts/2    Z<   11:50   0:00          \_ [cat] <defunct>
1001       16068  0.0  0.0   8044  4924 ?        S<s  12:37   0:00          \_ bash --rcfile /opt/nvidia/sdkmanager/resources/app/scripts/linux/sdkmanager.bashrc --noprofile
1001       16073  0.0  0.1 1660952 46484 ?       S<l+ 12:37   0:01          |   \_ /opt/nvidia/sdkmanager/resources/app/output/installUtils/adapter -a=install -c=eyJjb21wSnNvbk9iamVjdCI6eyJuYW1lIjoiQ1VEQSBUb29sa2l0
root       16192  0.0  0.0  16744  7152 ?        S<+  12:37   0:00          |       \_ sudo -E apt-get -y --allow-downgrades --allow-downgrades install cuda-toolkit-13-2=13.2*
root       16274  0.0  0.0  16744  2608 ?        S<s+ 12:37   0:00          |           \_ sudo -E apt-get -y --allow-downgrades --allow-downgrades install cuda-toolkit-13-2=13.2*
root       16275  0.0  0.3 119456 109556 ?       S<   12:37   0:01          |               \_ apt-get -y --allow-downgrades --allow-downgrades install cuda-toolkit-13-2=13.2*
root       16097  0.0  0.0      0     0 ?        Z<s  12:37   0:00          \_ [sudo] <defunct>
root       16124  0.0  0.0      0     0 ?        Z<   12:37   0:00          \_ [dpkg-preconfigu] <defunct>
root       16183  0.0  0.0      0     0 ?        Z<s  12:37   0:00          \_ [sudo] <defunct>
root       16186  0.0  0.0      0     0 ?        Z<s  12:37   0:00          \_ [sudo] <defunct>
root       16194  0.0  0.0      0     0 ?        Z<s  12:37   0:00          \_ [sudo] <defunct>

So it’s very likely something failed somehow in the process but nothing was displayed in the interface.

I’ve rerun the installer, this time monitoring the terminal logs and go these kinds of errors in it:

error: Err:37 https://urm.nvidia.com/artifactory/ubuntu-mit-mirror-remote/ubuntu noble/main amd64        libfdt1 amd64 1.7.0-2build1                                                                              info:   Something wicked happened resolving 'urm.nvidia.com:https' (-5 - No address associated with hostname)

Indeed, I can’t ping urm.nvidia.com, the domain doesn’t resolve.

I now have a working Intel WiFi 7(802.11be) BE20* on my Orin.

To build and install the Intel WiFi kernel drivers on your Orin, or Thor:

Save attached defconfig.txt, jetpack72-download-extract.sh.txt nvbuild-kernel-dtb-and-Modules.sh.txt into ~/jp7.2 dropping the .txt extension.

nvbuild-kernel-dtb-and-Modules.sh.txt (3.1 KB)
jetpack72-download-extract.sh.txt (5.5 KB)
defconfig.txt (42.3 KB)

mkdir ~/jp7.2
cd ~/jp7.2

# the attached defconfig has requisite Intel Wifi kernel module settings
grep -i iwl defconfig
CONFIG_IWLWIFI=m
CONFIG_IWLDVM=m
CONFIG_IWLMVM=m)

chmod +x *.sh

# Run to download and extract Jetpack 7.2. 

./jetpack-download-extract.sh

# Then
cp ./defconfig Linux_for_Tegra/source/kernel/kernel-noble/arch/arm64/configs/ 
# Then run 
./nvbuild-kernel-dtb-and-Modules.sh

# When script asks "Run make menuconfig to review/edit kernel config before building? [y/N]"  type N 
# unless you need to add something in addition to Intel Wifi drivers.

# When script asks "Build modules only (m) or full kernel, modules, dtbs (A) [m/A]"   type A

# When script completes and kernel, modules and dtbs are built, 
# The script will print:

[[ "$(basename "$PWD")" == source && "$(basename "$(dirname "$PWD")")" == Linux_for_Tegra ]] || cd Linux_for_Tegra/source
source ../../jetpack-env.sh
./nvbuild.sh -o "$PWD/kernel_out" -i

Copy all 3 lines and paste back into the existing terminal.
That will install kernel to your Orin.
# see Note below if you must run this on host.


Now we must install the compiled Intel Wifi modules and firmware:

KVER="6.8.12-1021-tegra"
MOD_DIR="/lib/modules/$KVER/kernel/drivers/net/wireless/intel/iwlwifi"

sudo mkdir -p $MOD_DIR
sudo mkdir -p $MOD_DIR/dvm
sudo mkdir -p $MOD_DIR/mvm

cd Linux_for_Tegra/source/kernel_out/kernel/kernel-noble

sudo cp -p drivers/net/wireless/intel/iwlwifi/iwlwifi.ko $MOD_DIR/
sudo cp -p drivers/net/wireless/intel/iwlwifi/dvm/iwldvm.ko $MOD_DIR/dvm/
sudo cp -p drivers/net/wireless/intel/iwlwifi/mvm/iwlmvm.ko $MOD_DIR/mvm/

# Sign the Modules
sudo ./scripts/sign-file sha256 ./certs/signing_key.pem ./certs/signing_key.x509 $MOD_DIR/iwlwifi.ko
sudo ./scripts/sign-file sha256 ./certs/signing_key.pem ./certs/signing_key.x509 $MOD_DIR/mvm/iwlmvm.ko
sudo ./scripts/sign-file sha256 ./certs/signing_key.pem ./certs/signing_key.x509 $MOD_DIR/dvm/iwldvm.ko

# Persist Module Load
echo "iwlwifi" | sudo tee /etc/modules-load.d/intelwifi.conf

# Update Module Dependencies
sudo depmod -a $KVER

# Fetch, Extract, and Install Firmware
sudo apt install -y zstd

cd /tmp
if [ ! -d "linux-firmware" ]; then
    git clone --depth=1 https://gitlab.com/kernel-firmware/linux-firmware.git
fi

# Copy / extract Intel wifi firmware
sudo cp -v linux-firmware/iwlwifi-gl-c0-fm-c0* /lib/firmware/
sudo zstd -d "/lib/firmware/iwlwifi-gl-c0-fm-c0-86.ucode.zst"

# Set permissions
sudo chmod 0644 /lib/firmware/iwlwifi-gl-c0-fm-c0*

# Update Initramfs
sudo nv-update-initrd


Note: I’ve tested this today running it entirely on Orin. If you can’t run it on Orin the scripts work to build kernel, modules, dtbs on x86_64/amd64 hosts, but you’ll need to figure out what needs to be done to install Intel Wifi drivers to either l4t_initrd_flash.sh Orin or scp/rsync copy and install to Orin.

export INSTALL_MOD_PATH=“$PWD/Linux_for_Tegra/rootfs/”

Tried the above the first part docker pull etc worked ok as follows.

Digest: sha256:e9fe11857ce91c5c28cefcb6f693076adaf7b6f72565f4b1461d0de2a5452216
Status: Downloaded newer image for vllm/vllm-openai:v0.22.0-ubuntu2404
WARNING 06-07 15:15:14 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version.
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344]
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] █ █ █▄ ▄█
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.0
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] █▄█▀ █ █ █ █ model google/gemma-3-4b-it
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344]
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:278] non-default args: {‘model_tag’: ‘google/gemma-3-4b-it’, ‘model’: ‘google/gemma-3-4b-it’, ‘dtype’: ‘bfloat16’, ‘max_model_len’: 8192, ‘gpu_memory_utilization’: 0.85}
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(APIServer pid=1) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(APIServer pid=1) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(APIServer pid=1) - 9.0 which supports hardware CC >=9.0,<10.0
(APIServer pid=1) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(APIServer pid=1) - 11.0 which supports hardware CC >=11.0,<12.0
(APIServer pid=1) - 12.0 which supports hardware CC >=12.0,<13.0
(APIServer pid=1) _warn_unsupported_code(d, device_cc, code_ccs)
(APIServer pid=1) INFO 06-07 15:15:38 [model.py:617] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=1) INFO 06-07 15:15:38 [model.py:1752] Using max model len 8192
(APIServer pid=1) INFO 06-07 15:15:42 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-07 15:15:42 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) WARNING 06-07 15:15:42 [cuda.py:243] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(EngineCore pid=114) INFO 06-07 15:16:20 [core.py:112] Initializing a V1 LLM engine (v0.22.0) with config: model=‘google/gemma-3-4b-it’, speculative_config=None, tokenizer=‘google/gemma-3-4b-it’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-3-4b-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::qwen_gdn_attention_core’, ‘vllm::gdn_attention_core_xpu’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::deepseek_v4_attention’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_vision_items_per_batch’: 0, ‘encoder_cudagraph_max_frames_per_batch’: None, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_rope_kvcache_cat_mla’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’]), enable_flashinfer_autotune=True, moe_backend=‘auto’, linear_backend=‘auto’)
(EngineCore pid=114) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(EngineCore pid=114) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=114) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=114) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=114) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=114) - 11.0 which supports hardware CC >=11.0,<12.0
(EngineCore pid=114) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=114) _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=114) INFO 06-07 15:16:28 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.145:48341 backend=nccl
(EngineCore pid=114) INFO 06-07 15:16:28 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=114) INFO 06-07 15:16:29 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=114) INFO 06-07 15:16:49 [gpu_model_runner.py:5037] Starting to load model google/gemma-3-4b-it…
(EngineCore pid=114) INFO 06-07 15:16:49 [interfaces.py:172] Contains out of vocabulary multimodal tokens? False
(EngineCore pid=114) INFO 06-07 15:16:49 [cuda.py:433] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=114) INFO 06-07 15:16:49 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=114) INFO 06-07 15:16:50 [vllm.py:977] Asynchronous scheduling is enabled.
(EngineCore pid=114) INFO 06-07 15:16:50 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(EngineCore pid=114) INFO 06-07 15:16:50 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore pid=114) INFO 06-07 15:16:50 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=114) INFO 06-07 15:16:53 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 8.01 GiB. Available RAM: 39.65 GiB.
(EngineCore pid=114) INFO 06-07 15:16:53 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.37s/it]
(EngineCore pid=114)
(EngineCore pid=114) INFO 06-07 15:16:58 [default_loader.py:397] Loading weights took 4.85 seconds
(EngineCore pid=114) INFO 06-07 15:16:59 [gpu_model_runner.py:5132] Model loading took 8.61 GiB memory and 8.445627 seconds
(EngineCore pid=114) INFO 06-07 15:16:59 [gpu_model_runner.py:6136] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(EngineCore pid=114) INFO 06-07 15:17:15 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/ff4c8bb952/rank_0_0/backbone for vLLM’s torch.compile
(EngineCore pid=114) INFO 06-07 15:17:15 [backends.py:1148] Dynamo bytecode transform time: 13.61 s
(EngineCore pid=114) [rank0]:W0607 15:17:18.104000 114 torch/_inductor/utils.py:1731] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=114) INFO 06-07 15:17:25 [backends.py:378] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=114) INFO 06-07 15:17:40 [backends.py:393] Compiling a graph for compile range (1, 2048) takes 24.70 s
(EngineCore pid=114) INFO 06-07 15:17:47 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/e0fa11424d6f3360e6603cccc830ae2bcf88ee4d327c398fedc4549783deee47/rank_0_0/model
(EngineCore pid=114) INFO 06-07 15:17:47 [monitor.py:53] torch.compile took 45.99 s in total
(EngineCore pid=114) INFO 06-07 15:17:48 [monitor.py:81] Initial profiling/warmup run took 0.76 s
(EngineCore pid=114) WARNING 06-07 15:18:03 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-07 15:18:03 [gpu_model_runner.py:6279] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=114) INFO 06-07 15:18:07 [gpu_model_runner.py:6365] Estimated CUDA graph memory: 0.11 GiB total
(EngineCore pid=114) INFO 06-07 15:18:08 [gpu_worker.py:466] Available KV cache memory: 37.5 GiB
(EngineCore pid=114) INFO 06-07 15:18:08 [gpu_worker.py:481] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.8500 is equivalent to --gpu-memory-utilization=0.8483 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.8517. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=114) WARNING 06-07 15:18:08 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-07 15:18:08 [kv_cache_utils.py:1733] GPU KV cache size: 602,720 tokens
(EngineCore pid=114) INFO 06-07 15:18:08 [kv_cache_utils.py:1734] Maximum concurrency for 8,192 tokens per request: 73.57x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00, 7.30it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00, 8.72it/s]
(EngineCore pid=114) INFO 06-07 15:18:29 [gpu_model_runner.py:6456] Graph capturing finished in 13 secs, took 0.28 GiB
(EngineCore pid=114) INFO 06-07 15:18:29 [gpu_worker.py:619] CUDA graph pool memory: 0.28 GiB (actual), 0.11 GiB (estimated), difference: 0.18 GiB (62.1%).
(EngineCore pid=114) INFO 06-07 15:18:29 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=114) INFO 06-07 15:18:29 [core.py:302] init engine (profile, create kv cache, warmup model) took 90.28 s (compilation: 45.99 s)
(EngineCore pid=114) INFO 06-07 15:18:30 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) INFO 06-07 15:18:30 [api_server.py:592] Supported tasks: [‘generate’]
(APIServer pid=1) WARNING 06-07 15:18:31 [model.py:1509] Default vLLM sampling parameters have been overridden by the model’s generation_config.json: {'top_k': 64, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=1) INFO 06-07 15:18:36 [hf.py:488] Detected the chat template content format to be ‘openai’. You can set --chat-template-content-format to override this.
(APIServer pid=1) INFO 06-07 15:19:19 [base.py:224] Multi-modal warmup completed in 42.960s
(APIServer pid=1) INFO 06-07 15:19:19 [base.py:224] Readonly multi-modal warmup completed in 0.045s
(APIServer pid=1) INFO 06-07 15:19:20 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 192.168.1.145:48502 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 192.168.1.145:48502 - “GET /favicon.ico HTTP/1.1” 404 Not Found
(EngineCore pid=114) WARNING 06-07 15:20:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO: 127.0.0.1:40644 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:20:31 [loggers.py:271] Engine 000: Avg prompt throughput: 2.7 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-07 15:20:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 127.0.0.1:38658 - “GET /v1/chat/completions HTTP/1.1” 405 Method Not Allowed
(APIServer pid=1) INFO: 127.0.0.1:38658 - “GET /favicon.ico HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 127.0.0.1:55038 - “GET /v1/chat/completions HTTP/1.1” 405 Method Not Allowed
(APIServer pid=1) INFO: 127.0.0.1:35426 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO 06-07 15:23:51 [loggers.py:271] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 2.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO: 127.0.0.1:55026 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:24:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO 06-07 15:24:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO: 192.168.1.145:49882 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) WARNING: Invalid HTTP request received.
(APIServer pid=1) WARNING: Invalid HTTP request received.

(APIServer pid=1) INFO: 127.0.0.1:56316 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:30:01 [loggers.py:271] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 39.5%
(APIServer pid=1) INFO 06-07 15:30:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 39.5%
(APIServer pid=1) INFO: 127.0.0.1:41894 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 127.0.0.1:41894 - “GET /favicon.ico HTTP/1.1” 404 Not Found
^C(EngineCore pid=114) INFO 06-07 15:37:10 [core.py:1266] Shutdown initiated (timeout=0)
(EngineCore pid=114) INFO 06-07 15:37:10 [core.py:1289] Shutdown complete
(APIServer pid=1) INFO 06-07 15:37:10 [launcher.py:137] Shutting down FastAPI HTTP server.
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.

Error occured as had another server live my fault.

the results post /vi/chat/completions HTTP/1.1 200 OK

This is the second part whih seem not correct

~$ curl http://localhost:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “google/gemma-3-4b-it”,
“messages”: [
{
“role”: “user”,
“content”: “Explain in one sentence why running an upstream Arm64 container on Jetson is useful.”
}
]
}’
{“id”:“chatcmpl-980611f4033a2e6f”,“object”:“chat.completion”,“created”:1780846197,“model”:“google/gemma-3-4b-it”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,"paul@orpaul@opauppppaupaupppapapaupppppapaul@paulpapaulpapapppappaul@orin:~$

ANY THOUGHTS WOULD BE APPRECIATED.

The ISO method did not work in my case. Booting from the USB drive, the updater states that it is preparing the capsule. When the system reboots, it starts a boot loop stating it will restart the system in 2 seconds. On the fifth attempt, the system halts saying it is non-recoverable. There does not appear to be any attempt to actually update the firmware as the first part of the update process. This is on a AGX Orin 32GB, but it’s an early one.

I had similar problem on my 32gb Orin and ended up using:

sudo ./tools/kernel_flash/l4t_initrd_flash.sh \
  --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml \
  --showlogs \
  jetson-agx-orin-devkit external

# Then 
sudo apt update
sudo apt install nvidia-jetpack

thanks a lot, will try later when the release is more stable, it took me a whole day to flush…