Hello bright minds, I should preface this message by saying, I’m lost. This technology is new to me. Sure some of this stuff has been around for years or longer, but in my previous job roles, I started to lose my edge on tech, now I’m back in the ‘trenches of I.T.’ which honestly I do love. I’m just overwhelmed with all this new stuff. So lets get to it, be kind please.
Our company bought 2 sparks for our department, one (well both) were at my boss’s place while he tinkered and learned a bit. He managed to get his hands on serious hardware so he’s able to return these to work, but he mailed one to me to get my feet wet, so we can eventually use them at work. The use case at work is basically system admin automation tasks, hey robot build me a VM, or hey robot we got alert taken actions x,y,z, that kind of thing. But that is way off in the distance.
I am running the vLLM playbook from NVIDIA as my boss chose vLLM as um I’m not sure what its even called the backend LLM processor? (see how lost I am?) the “server”? anyways that. He feels that vLLM is the best ‘thing’ for use for Sparks so he told me to go tinker on the one at my home and learn and figure stuff out. (Fun job to get paid to play with all day but also super stressful because I’m so weak in all the areas, containers, LLMs, AI, APIs, …)
So I ran the simple playbook here for the vLLM:
https://build.nvidia.com/spark/vllm/instructions
I’m on step 3.
It works. But here come the questions…then come questions about a bigger model, (I’ll post screenshots too)
I’m conflicted how much of the text from the start up for vLLM following the link I should provide, because as you all know its a lot of text, so I will post the beginning part when its starting up and then some of the end when its ‘ready’
Starting up the container with vLLM and the model it talks about:
sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
==========
== vLLM ==
==========
NVIDIA Release 26.02 (build 270699521)
vLLM Version 0.15.1+befbc472
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 13.1 driver version 590.48.01 with kernel driver version 580.126.09.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for vLLM. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] █ █ █▄ ▄█
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1+befbc472
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] █▄█▀ █ █ █ █ model Qwen/Qwen2.5-Math-1.5B-Instruct
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:261] non-default args: {'model_tag': 'Qwen/Qwen2.5-Math-1.5B-Instruct', 'api_server_count': 1, 'model': 'Qwen/Qwen2.5-Math-1.5B-Instruct'}
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 656/656 [00:00<00:00, 10.0MB/s]
(APIServer pid=1) INFO 03-18 12:21:08 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-18 12:21:08 [model.py:1561] Using max model len 4096
(APIServer pid=1) INFO 03-18 12:21:08 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-18 12:21:08 [vllm.py:624] Asynchronous scheduling is enabled.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 2.77MB/s]
tokenizer_config.json: 7.32kB [00:00, 55.5MB/s]
vocab.json: 2.78MB [00:00, 21.3MB/s]
merges.txt: 1.67MB [00:00, 65.9MB/s]
tokenizer.json: 7.03MB [00:00, 163MB/s]
/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
(EngineCore_DP0 pid=185) INFO 03-18 12:21:12 [core.py:96] Initializing a V1 LLM engine (v0.15.1+befbc472) with config: model='Qwen/Qwen2.5-Math-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-Math-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-Math-1.5B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:33491 backend=nccl
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [gpu_model_runner.py:4033] Starting to load model Qwen/Qwen2.5-Math-1.5B-Instruct...
Now some of the output from the bottom once it finished firing up:
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 172.17.0.1:41382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 03-18 12:24:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 12:24:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
so after the server started I ran the test command in the play book and asked it 2+2 (need super computer powers for this!) in a curl request
sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"messages": [{"role": "user", "content": "2+2"}],
"max_tokens": 500
}'
{"id":"chatcmpl-a6cdb0d5a62eb4d2","object":"chat.completion","created":1773836642,"model":"Qwen/Qwen2.5-Math-1.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"To solve the expression \\(2 + 2\\), we follow these steps:\n\n1. Identify the numbers in the expression. Here, both numbers are 2.\n2. Add the two numbers together. When we add 2 and 2, we get 4.\n\nSo, the value of the expression \\(2 + 2\\) is \\(\\boxed{4}\\).","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":31,"total_tokens":108,"completion_tokens":77,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}sparkit@bd-it-spark01:~$
Now from the time I hit enter on the above to the time it spit out the answer was maybe a second, heck it might say so exactly in the text somewhere. Point it, it was lightening fast. All good so far!
Question 1. Where is the model Qwen/Qwen2.5-Math-1.5B-Instruct on my computer? Is it downloading that model every time I fire up docker and run the play book? Or did it do that the first time and now its cached somewhere, if so where? I ask because I’ve downloaded some big models from HF and they all download to a cache folder which is listed when the download finishes (downloading via CLI using the hf command).
Question 2. This model is small yes? Why is the DGX dashboard showing 122GB of 128GB being used!?! That seems wrong. I would think it should be way lower. I mean If I can run way bigger models on my 32GB Mac using LM studio and they don’t peg out the RAM how is this tiny 1.5B model that NVIDIA lists in its own playbook eating up basically all the RAM?! Misconfiguration in vLLM on my end? Misconfigured docker settings on start up of the model on my end? (I’m just following the link verbatim).
Putting the memory aside for a second, I know that say I wanted to use this sample model I would need to tie it into an API server, perhaps something like Open UI right? Would I download a docker image of just OpenUI? Again I ask because I have seen and played with the playbook that has open UI and Ollama bundled, and it works fine, I can download different models through the UI and load them. Run in parallel ect..
Now the bigger problem. Say I follow the same commands above but load a bigger model (I picked to download models that were specifically listed as supported in the playbook.
Same link just the first page of it
https://build.nvidia.com/spark/vllm/overview
Using hugginface CLI I downloaded: Nemotron 3 Super 120B, Qwen3-32B, and lastly Phi-4-mutlitmodal-instruct.
Just grabbing a few different models to test with and various sizes.
The TLDR version to come below, is that I can get the model loaded but it takes forever (fine big model) but it takes forever to respond (using the same sample query as before, just changing the model name and pointing to the cached copy of the LLM. It takes forever to spit out an answer, the server prompt shows something like 3-7 token’s per second, and when I do get a response, sometimes its repeated over and over, you’ll see below. If you read on.
I am going to load the Phi4 model just cause I picked it randomly. Remember this is downloaded on the spark already, its located in /home/sparkit/.cache/huggingface/hub/
Also when I am shutting down vLLM I just do a control + c , not sure if this is clean or not because it says the following when I do that:
^C(APIServer pid=1) INFO 03-18 12:57:05 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W318 12:57:05.834174784 ProcessGroupNCCL.cpp:1569] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
So not sure there. But I don’t know what else to do.
Give me a few (or more than a few minutes) to spin up the phi 4 model and I’ll post its behavior below.
sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 -v /home/sparkit/.cache/huggingface/hub/models nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} vllm serve "nvidia/Phi-4-reasoning-plus-FP8"
==========
== vLLM ==
==========
NVIDIA Release 26.02 (build 270699521)
vLLM Version 0.15.1+befbc472
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 13.1 driver version 590.48.01 with kernel driver version 580.126.09.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for vLLM. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] █ █ █▄ ▄█
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1+befbc472
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] █▄█▀ █ █ █ █ model nvidia/Phi-4-reasoning-plus-FP8
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:261] non-default args: {'model_tag': 'nvidia/Phi-4-reasoning-plus-FP8', 'api_server_count': 1, 'model': 'nvidia/Phi-4-reasoning-plus-FP8'}
config.json: 1.74kB [00:00, 15.9MB/s]
(APIServer pid=1) INFO 03-18 13:00:06 [model.py:541] Resolved architecture: Phi3ForCausalLM
(APIServer pid=1) INFO 03-18 13:00:06 [model.py:1561] Using max model len 32768
(APIServer pid=1) INFO 03-18 13:00:06 [cache.py:216] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
.....
takes forever on these:
(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [core.py:96] Initializing a V1 LLM engine (v0.15.1+befbc472) with config: model='nvidia/Phi-4-reasoning-plus-FP8', speculative_config=None, tokenizer='nvidia/Phi-4-reasoning-plus-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Phi-4-reasoning-plus-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:34313 backend=nccl
(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [gpu_model_runner.py:4033] Starting to load model nvidia/Phi-4-reasoning-plus-FP8...
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [__init__.py:184] Selected FlashInferFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [cuda.py:364] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
model.safetensors.index.json: 55.3kB [00:00, 252MB/s]
model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.03G/1.03G [01:26<00:00, 11.9MB/s]
model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.86G/4.86G [01:28<00:00, 55.2MB/s]
(EngineCore_DP0 pid=186) tensors: 21%|████████████████████████▏ | 1.02G/4.84G [01:19<03:04, 20.7MB/s]
(EngineCore_DP0 pid=186) tensors: 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 4.80G/4.86G [01:25<00:00, 96.0MB/s]
after they load up (I can see my RAM BTW ramping from 27GB after the first startup call up back up to close to 120GB used)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:40<02:02, 40.71s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [01:22<01:22, 41.23s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [01:23<00:22, 22.74s/it]
still hovering around 27GB RAM at this point...
next block that takes awhile to load...(below)
(EngineCore_DP0 pid=186) INFO 03-18 13:04:52 [default_loader.py:291] Loading weights took 124.36 seconds
(EngineCore_DP0 pid=186) WARNING 03-18 13:04:52 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore_DP0 pid=186) WARNING 03-18 13:04:52 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore_DP0 pid=186) INFO 03-18 13:04:53 [gpu_model_runner.py:4130] Model loading took 14.74 GiB memory and 280.527628 seconds
(EngineCore_DP0 pid=186) INFO 03-18 13:04:58 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/bfe0f27543/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=186) INFO 03-18 13:04:58 [backends.py:872] Dynamo bytecode transform time: 5.22 s
(EngineCore_DP0 pid=186) INFO 03-18 13:05:04 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
Now RAM is ramping to 50-60GB and growing sometimes it'll drop into the 30's and then ramp back up during this time.
(EngineCore_DP0 pid=186) INFO 03-18 13:07:57 [monitor.py:34] torch.compile takes 182.09 s in total
(EngineCore_DP0 pid=186) INFO 03-18 13:07:57 [decorators.py:576] saving AOT compiled function to /root/.cache/vllm/torch_aot_compile/9052722fd1e2e1f01edc013fed5141308769076c4ce33f460cee73f2c67cb1aa/rank_0_0/model
(EngineCore_DP0 pid=186) INFO 03-18 13:07:59 [decorators.py:580] saved AOT compiled function to /root/.cache/vllm/torch_aot_compile/9052722fd1e2e1f01edc013fed5141308769076c4ce33f460cee73f2c67cb1aa/rank_0_0/model
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [gpu_worker.py:356] Available KV cache memory: 89.37 GiB
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [kv_cache_utils.py:1307] GPU KV cache size: 937,120 tokens
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 28.60x
(EngineCore_DP0 pid=186) 2026-03-18 13:08:09,549 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=186) 2026-03-18 13:08:23,415 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=186) INFO 03-18 13:08:23 [kernel_warmup.py:64] Warming up FlashInfer attention.
Now we are 120GB, it just spikes all the way up, the GPU utilization starts going haywire and jumping from 0 to 100 to everything in between..
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:11<00:00, 4.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:05<00:00, 6.12it/s]
(EngineCore_DP0 pid=186) INFO 03-18 13:09:15 [gpu_model_runner.py:5063] Graph capturing finished in 18 secs, took 1.22 GiB
(EngineCore_DP0 pid=186) INFO 03-18 13:09:15 [core.py:272] init engine (profile, create kv cache, warmup model) took 262.32 seconds
(EngineCore_DP0 pid=186) INFO 03-18 13:09:17 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 03-18 13:09:17 [api_server.py:668] Supported tasks: ['generate']
(APIServer pid=1) WARNING 03-18 13:09:17 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.8, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 03-18 13:09:17 [serving.py:177] Warming up chat template processing...
(APIServer pid=1) INFO 03-18 13:09:18 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 03-18 13:09:18 [serving.py:212] Chat template warmup completed in 1080.3ms
(APIServer pid=1) INFO 03-18 13:09:19 [api_server.py:949] Starting vLLM API server 0 on http://0.0.0.0:8000
..
now it shows a bunch of info messages about route: /v1/chat and others but ultimately now it says the server is ready.
Now I run my simple 2+2 query in another shell, (again ram is pegged at 121GB) GPU 0%.
sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Phi-4-reasoning-plus-FP8",
"messages": [{"role": "user", "content": "2+2"}],
"max_tokens": 500
}'
It immediately logs the following on the server side, now we wait for quite some time, memory never drops
APIServer pid=1) INFO 03-18 13:14:49 [loggers.py:257] Engine 000: Avg prompt throughput: 22.9 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:14:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:39632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 03-18 13:15:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
and of course, this time it didn’t take as long as it normally does (usually minutes) maybe its cached my 2+2? I’ll try something different in a second, but here is the client side output:
sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Phi-4-reasoning-plus-FP8",
"messages": [{"role": "user", "content": "2+2"}],
"max_tokens": 500
}'
{"id":"chatcmpl-a49efea3d38eeb08","object":"chat.completion","created":1773839688,"model":"nvidia/Phi-4-reasoning-plus-FP8","choices":[{"index":0,"message":{"role":"assistant","content":"<think>User query: \"2+2\". I need to produce answer. Possibly a plain arithmetic addition: 2+2 = 4.\n\nWait, but careful: \"2+2\" is simple arithmetic addition. But maybe the user expects something else? Possibly \"2+2\" is a math problem. But instructions \"2+2\" might be interpreted as a request for help with addition.\n\nBut I should check instructions: \"2+2\", which is math expression.\n\nI check that the answer is \"4\". However, I must check if there's any potential hidden trick. Possibly the user is asking for \"2+2\" and my answer should be \"4\" but also maybe \"2 plus 2 equals 4\" but then I need to check if there is any content policy risk? But it's safe.\n\nBut also check if arithmetic is not a code. Actually it's safe.\n\nI'll simply produce \"2+2=4\" explanation maybe.\n\nBut check instructions: \"2+2\", simply output answer \"4\" in plain text.\n\nI check conversation instructions: \"2+2\", it's a math expression. The answer is \"4\". Possibly I'll produce explanation.\n\nI produce: \"2+2 equals 4.\"\n\nI'll produce answer: \"4\" and maybe \"two plus two equals four\" in plain text.\n\nI'll produce answer in plain text: \"2+2=4.\" Possibly I'll produce explanation: \"It equals 4.\" Possibly I'll produce answer explanation: \"2+2=4\" with details.\n\nI'll produce: \"2+2 is 4.\"\n\nI'll produce answer with careful explanation: \"2+2 equals 4.\" I'll produce answer as \"4.\"\n\nI'll produce answer: \"2+2=4\" with explanation if needed.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"The sum of 2 and 2 is 4.\"\n\nI'll produce answer: \"2+2=4.\"\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"4.\" and explanation: \"the addition of two and two equals four.\" I'll produce answer in plain text.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"2+2 = 4.\"\n\nI'll produce answer in plain text.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":229,"total_tokens":729,"completion_tokens":500,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}sparkit@bd-it-spark01:~$
Very annoyed this ran fast! I swear they were taking ages! Let me try a different query I’ll ask it to print some text…
I’m not touching the server side..
sending a new request now lets see how long it takes...
sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Phi-4-reasoning-plus-FP8",
"messages": [{"role": "user", "content": "Display the following text on the screen: I'm a newbie! Help!"}],
"max_tokens": 500
}'
on the server side, we see
(APIServer pid=1) INFO 03-18 13:15:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-18 13:15:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% waiting.....RAM 121GB used..I guess there are time stamps here so I dont need to emphasize the time its taking GPU usage 0% getting coffee this is taking forever.. I suppose I can take this time and say that I barely even know what tokens are, no idea what GPU KV cache usage is, I am assuming KV is key-value pair like in python (I do know very basic python). I am not asking you to teach me everything, just help me with some fundamentals to get models to run normally, this seems insanely slow, I scared to even try the 120B models! I am assuming its still doing something, where I typed in the curl request I have a new prompt that I can type in, yet the server is still on the portion posted above so I guessing something is happening or else I would think it would have errored out. If I had to speculate I would guess its my ignorance in docker and the passing the right arguments to run the model correctly, but I dont know docker, (i am trying to learn) plus there are some many LLM specific things you can set, it just leaves me lost. I tutorial just gets it up and running but I guess I have to go read about vllm on my own and docker with vllm I guess? Ok I take that back about having a command prompt on the client side, it appears I do but i cant run any commands they dont return anything, not even ls works. So it must still be held hostage by the query. This is the worst I have seen it.. nothing is happening anywhere. I spun up a 3rd shell, ran top, I see VLLM::EngineCore at the top of the list, followed by (they jump around in top) containerd, node, code-ce99c1ed2 (guessing the UUID of the docker instance?) oddly enough the CPU is not showing as pegged on any process, the are all almost idle!. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11086 root 20 0 148.7g 4.3g 948296 S 3.0 3.6 6:31.31 VLLM::EngineCor 4439 sparkit 20 0 21.4g 97500 27392 S 0.7 0.1 0:09.18 node 17949 sparkit 20 0 25472 5616 3416 R 0.7 0.0 0:00.49 top 10776 root 20 0 9562132 873112 196824 S 0.3 0.7 0:10.38 vllm 1 root 20 0 23460 12892 7672 S 0.0 0.0 0:03.67 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pool_workqueue_release 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-rcu_gp 5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-sync_wq 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/R-kvfree_rcu_reclaim top - 09:42:07 up 1:31, 7 users, load average: 0.24, 0.21, 0.41 Tasks: 438 total, 1 running, 437 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 95.5/124545.5 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ] MiB Swap: 0.0/16384.0 [ ]
while we are waiting here’s more info..
sparkit@bd-it-spark01:~$ docker images
i Info → U In Use
IMAGE ID DISK USAGE CONTENT SIZE EXTRA
nvcr.io/nvidia/vllm:26.02-py3 1bec659df629 21.9GB 6.77GB U
here’s an image of the 2nd shell where i ran the query
serverside, still sitting there..
I guess I’m going to kill it, normally what I see is
the tokens start out high and drop and eventually I get an answer, sometimes the answer is repeated multiple times. basically saying the same thing like a loop. I wish I had an example.
(APIServer pid=1) INFO 03-18 13:14:49 [loggers.py:257] Engine 000: Avg prompt throughput: 22.9 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 1 reqs,
where it says 22.9 token/s sec, this is where I see the numbers (usually except this case where it seems just stuck..) drop down to 9 tokens 3 tokens sometimes towards the end it says 1 token per second. before finally giving me a result.
So I am going to cancel this run, cause clearly nothing is happening. NO idea why. I will post on this same thread another retry same model same query but maybe increase the max_tokens on the requesting side. I have a meeting in 10 minutes. But I guess maybe this is a good example since now its not even doing anything. Any and all help is most humbly appreciated. I know there are many eyerolling going on. I get it. I’m a newbie, and just trying to get to a point where the models run decently and don’t eat up all the memory (I don’t think that’s normal, maybe it is for this model), I wanted to work on getting openUI in a container to then ‘connect’ it to this vllm so I could chat with it, that’s the right approach right? I’m so sorry guys I’m really trying here, 8 hrs a day plus on my own time reading, ect… and always struggling. OpenUI/Ollama was the only “success” I had but’s not what we are using so while it was fun to play and test out models, that’s not the route our company has chosen. I know there’s 8 million ways to tune this for every particular model. I was hoping it would at least ‘run’ so I could get it talking to openUI and THEN figure out how to tune, guess not. Another example will come. Thank you to everyone who stuck to this very long thread and took the time to read it and respond, you are amazing and I thank you for any bit of help. Bless you! Cheers. Hoping for some responses :) remember dumb it down, I feel like a child with crayons learning to color with this spark unit!


