Tried the above the first part docker pull etc worked ok as follows.
Digest: sha256:e9fe11857ce91c5c28cefcb6f693076adaf7b6f72565f4b1461d0de2a5452216
Status: Downloaded newer image for vllm/vllm-openai:v0.22.0-ubuntu2404
WARNING 06-07 15:15:14 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version.
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344]
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] █ █ █▄ ▄█
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.0
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] █▄█▀ █ █ █ █ model google/gemma-3-4b-it
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:344]
(APIServer pid=1) INFO 06-07 15:15:14 [utils.py:278] non-default args: {‘model_tag’: ‘google/gemma-3-4b-it’, ‘model’: ‘google/gemma-3-4b-it’, ‘dtype’: ‘bfloat16’, ‘max_model_len’: 8192, ‘gpu_memory_utilization’: 0.85}
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-07 15:15:14 [envs.py:2057] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(APIServer pid=1) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(APIServer pid=1) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(APIServer pid=1) - 9.0 which supports hardware CC >=9.0,<10.0
(APIServer pid=1) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(APIServer pid=1) - 11.0 which supports hardware CC >=11.0,<12.0
(APIServer pid=1) - 12.0 which supports hardware CC >=12.0,<13.0
(APIServer pid=1) _warn_unsupported_code(d, device_cc, code_ccs)
(APIServer pid=1) INFO 06-07 15:15:38 [model.py:617] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=1) INFO 06-07 15:15:38 [model.py:1752] Using max model len 8192
(APIServer pid=1) INFO 06-07 15:15:42 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-07 15:15:42 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) WARNING 06-07 15:15:42 [cuda.py:243] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(EngineCore pid=114) INFO 06-07 15:16:20 [core.py:112] Initializing a V1 LLM engine (v0.22.0) with config: model=‘google/gemma-3-4b-it’, speculative_config=None, tokenizer=‘google/gemma-3-4b-it’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-3-4b-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::qwen_gdn_attention_core’, ‘vllm::gdn_attention_core_xpu’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::deepseek_v4_attention’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_vision_items_per_batch’: 0, ‘encoder_cudagraph_max_frames_per_batch’: None, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False, ‘fuse_rope_kvcache_cat_mla’: False, ‘fuse_act_padding’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: }, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’]), enable_flashinfer_autotune=True, moe_backend=‘auto’, linear_backend=‘auto’)
(EngineCore pid=114) /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:371: UserWarning: Found GPU0 Orin which is of compute capability (CC) 8.7.
(EngineCore pid=114) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=114) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=114) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=114) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=114) - 11.0 which supports hardware CC >=11.0,<12.0
(EngineCore pid=114) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=114) _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=114) INFO 06-07 15:16:28 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.145:48341 backend=nccl
(EngineCore pid=114) INFO 06-07 15:16:28 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=114) INFO 06-07 15:16:29 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=114) INFO 06-07 15:16:49 [gpu_model_runner.py:5037] Starting to load model google/gemma-3-4b-it…
(EngineCore pid=114) INFO 06-07 15:16:49 [interfaces.py:172] Contains out of vocabulary multimodal tokens? False
(EngineCore pid=114) INFO 06-07 15:16:49 [cuda.py:433] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=114) INFO 06-07 15:16:49 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=114) INFO 06-07 15:16:50 [vllm.py:977] Asynchronous scheduling is enabled.
(EngineCore pid=114) INFO 06-07 15:16:50 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(EngineCore pid=114) INFO 06-07 15:16:50 [cuda.py:378] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore pid=114) INFO 06-07 15:16:50 [flash_attn.py:636] Using FlashAttention version 2
(EngineCore pid=114) INFO 06-07 15:16:53 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 8.01 GiB. Available RAM: 39.65 GiB.
(EngineCore pid=114) INFO 06-07 15:16:53 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.64s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.37s/it]
(EngineCore pid=114)
(EngineCore pid=114) INFO 06-07 15:16:58 [default_loader.py:397] Loading weights took 4.85 seconds
(EngineCore pid=114) INFO 06-07 15:16:59 [gpu_model_runner.py:5132] Model loading took 8.61 GiB memory and 8.445627 seconds
(EngineCore pid=114) INFO 06-07 15:16:59 [gpu_model_runner.py:6136] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(EngineCore pid=114) INFO 06-07 15:17:15 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/ff4c8bb952/rank_0_0/backbone for vLLM’s torch.compile
(EngineCore pid=114) INFO 06-07 15:17:15 [backends.py:1148] Dynamo bytecode transform time: 13.61 s
(EngineCore pid=114) [rank0]:W0607 15:17:18.104000 114 torch/_inductor/utils.py:1731] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=114) INFO 06-07 15:17:25 [backends.py:378] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=114) INFO 06-07 15:17:40 [backends.py:393] Compiling a graph for compile range (1, 2048) takes 24.70 s
(EngineCore pid=114) INFO 06-07 15:17:47 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/e0fa11424d6f3360e6603cccc830ae2bcf88ee4d327c398fedc4549783deee47/rank_0_0/model
(EngineCore pid=114) INFO 06-07 15:17:47 [monitor.py:53] torch.compile took 45.99 s in total
(EngineCore pid=114) INFO 06-07 15:17:48 [monitor.py:81] Initial profiling/warmup run took 0.76 s
(EngineCore pid=114) WARNING 06-07 15:18:03 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-07 15:18:03 [gpu_model_runner.py:6279] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=114) INFO 06-07 15:18:07 [gpu_model_runner.py:6365] Estimated CUDA graph memory: 0.11 GiB total
(EngineCore pid=114) INFO 06-07 15:18:08 [gpu_worker.py:466] Available KV cache memory: 37.5 GiB
(EngineCore pid=114) INFO 06-07 15:18:08 [gpu_worker.py:481] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.8500 is equivalent to --gpu-memory-utilization=0.8483 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.8517. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=114) WARNING 06-07 15:18:08 [kv_cache_utils.py:1157] Add 1 padding layers, may waste at most 3.45% KV cache memory
(EngineCore pid=114) INFO 06-07 15:18:08 [kv_cache_utils.py:1733] GPU KV cache size: 602,720 tokens
(EngineCore pid=114) INFO 06-07 15:18:08 [kv_cache_utils.py:1734] Maximum concurrency for 8,192 tokens per request: 73.57x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00, 7.30it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00, 8.72it/s]
(EngineCore pid=114) INFO 06-07 15:18:29 [gpu_model_runner.py:6456] Graph capturing finished in 13 secs, took 0.28 GiB
(EngineCore pid=114) INFO 06-07 15:18:29 [gpu_worker.py:619] CUDA graph pool memory: 0.28 GiB (actual), 0.11 GiB (estimated), difference: 0.18 GiB (62.1%).
(EngineCore pid=114) INFO 06-07 15:18:29 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=114) INFO 06-07 15:18:29 [core.py:302] init engine (profile, create kv cache, warmup model) took 90.28 s (compilation: 45.99 s)
(EngineCore pid=114) INFO 06-07 15:18:30 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=[‘native’], fused_add_rms_norm=[‘native’])
(APIServer pid=1) INFO 06-07 15:18:30 [api_server.py:592] Supported tasks: [‘generate’]
(APIServer pid=1) WARNING 06-07 15:18:31 [model.py:1509] Default vLLM sampling parameters have been overridden by the model’s generation_config.json: {'top_k': 64, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=1) INFO 06-07 15:18:36 [hf.py:488] Detected the chat template content format to be ‘openai’. You can set --chat-template-content-format to override this.
(APIServer pid=1) INFO 06-07 15:19:19 [base.py:224] Multi-modal warmup completed in 42.960s
(APIServer pid=1) INFO 06-07 15:19:19 [base.py:224] Readonly multi-modal warmup completed in 0.045s
(APIServer pid=1) INFO 06-07 15:19:20 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 06-07 15:19:20 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 192.168.1.145:48502 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 192.168.1.145:48502 - “GET /favicon.ico HTTP/1.1” 404 Not Found
(EngineCore pid=114) WARNING 06-07 15:20:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO: 127.0.0.1:40644 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:20:31 [loggers.py:271] Engine 000: Avg prompt throughput: 2.7 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-07 15:20:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 127.0.0.1:38658 - “GET /v1/chat/completions HTTP/1.1” 405 Method Not Allowed
(APIServer pid=1) INFO: 127.0.0.1:38658 - “GET /favicon.ico HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 127.0.0.1:55038 - “GET /v1/chat/completions HTTP/1.1” 405 Method Not Allowed
(APIServer pid=1) INFO: 127.0.0.1:35426 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO 06-07 15:23:51 [loggers.py:271] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 2.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO: 127.0.0.1:55026 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:24:01 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO 06-07 15:24:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 29.6%
(APIServer pid=1) INFO: 192.168.1.145:49882 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) WARNING: Invalid HTTP request received.
(APIServer pid=1) WARNING: Invalid HTTP request received.
(APIServer pid=1) INFO: 127.0.0.1:56316 - “POST /v1/chat/completions HTTP/1.1” 200 OK
(APIServer pid=1) INFO 06-07 15:30:01 [loggers.py:271] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 4.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 39.5%
(APIServer pid=1) INFO 06-07 15:30:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 39.5%
(APIServer pid=1) INFO: 127.0.0.1:41894 - “GET / HTTP/1.1” 404 Not Found
(APIServer pid=1) INFO: 127.0.0.1:41894 - “GET /favicon.ico HTTP/1.1” 404 Not Found
^C(EngineCore pid=114) INFO 06-07 15:37:10 [core.py:1266] Shutdown initiated (timeout=0)
(EngineCore pid=114) INFO 06-07 15:37:10 [core.py:1289] Shutdown complete
(APIServer pid=1) INFO 06-07 15:37:10 [launcher.py:137] Shutting down FastAPI HTTP server.
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
Error occured as had another server live my fault.
the results post /vi/chat/completions HTTP/1.1 200 OK
This is the second part whih seem not correct
~$ curl http://localhost:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “google/gemma-3-4b-it”,
“messages”: [
{
“role”: “user”,
“content”: “Explain in one sentence why running an upstream Arm64 container on Jetson is useful.”
}
]
}’
{“id”:“chatcmpl-980611f4033a2e6f”,“object”:“chat.completion”,“created”:1780846197,“model”:“google/gemma-3-4b-it”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,"paul@orpaul@opauppppaupaupppapapaupppppapaul@paulpapaulpapapppappaul@orin:~$
ANY THOUGHTS WOULD BE APPRECIATED.