Who wants to be the hero and help a total newbie! Got a spark and um, yeah

Hello bright minds, I should preface this message by saying, I’m lost. This technology is new to me. Sure some of this stuff has been around for years or longer, but in my previous job roles, I started to lose my edge on tech, now I’m back in the ‘trenches of I.T.’ which honestly I do love. I’m just overwhelmed with all this new stuff. So lets get to it, be kind please.

Our company bought 2 sparks for our department, one (well both) were at my boss’s place while he tinkered and learned a bit. He managed to get his hands on serious hardware so he’s able to return these to work, but he mailed one to me to get my feet wet, so we can eventually use them at work. The use case at work is basically system admin automation tasks, hey robot build me a VM, or hey robot we got alert taken actions x,y,z, that kind of thing. But that is way off in the distance.

I am running the vLLM playbook from NVIDIA as my boss chose vLLM as um I’m not sure what its even called the backend LLM processor? (see how lost I am?) the “server”? anyways that. He feels that vLLM is the best ‘thing’ for use for Sparks so he told me to go tinker on the one at my home and learn and figure stuff out. (Fun job to get paid to play with all day but also super stressful because I’m so weak in all the areas, containers, LLMs, AI, APIs, …)

So I ran the simple playbook here for the vLLM:

https://build.nvidia.com/spark/vllm/instructions

I’m on step 3.

It works. But here come the questions…then come questions about a bigger model, (I’ll post screenshots too)

I’m conflicted how much of the text from the start up for vLLM following the link I should provide, because as you all know its a lot of text, so I will post the beginning part when its starting up and then some of the end when its ‘ready’

Starting up the container with vLLM and the model it talks about:

sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"

==========
== vLLM ==
==========

NVIDIA Release 26.02 (build 270699521)
vLLM Version 0.15.1+befbc472
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.1 driver version 590.48.01 with kernel driver version 580.126.09.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for vLLM.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] 
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1+befbc472
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-Math-1.5B-Instruct
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:325] 
(APIServer pid=1) INFO 03-18 12:21:04 [utils.py:261] non-default args: {'model_tag': 'Qwen/Qwen2.5-Math-1.5B-Instruct', 'api_server_count': 1, 'model': 'Qwen/Qwen2.5-Math-1.5B-Instruct'}
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 656/656 [00:00<00:00, 10.0MB/s]
(APIServer pid=1) INFO 03-18 12:21:08 [model.py:541] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) INFO 03-18 12:21:08 [model.py:1561] Using max model len 4096
(APIServer pid=1) INFO 03-18 12:21:08 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-18 12:21:08 [vllm.py:624] Asynchronous scheduling is enabled.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 2.77MB/s]
tokenizer_config.json: 7.32kB [00:00, 55.5MB/s]
vocab.json: 2.78MB [00:00, 21.3MB/s]
merges.txt: 1.67MB [00:00, 65.9MB/s]
tokenizer.json: 7.03MB [00:00, 163MB/s]
/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
(EngineCore_DP0 pid=185) INFO 03-18 12:21:12 [core.py:96] Initializing a V1 LLM engine (v0.15.1+befbc472) with config: model='Qwen/Qwen2.5-Math-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-Math-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-Math-1.5B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:33491 backend=nccl
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=185) INFO 03-18 12:21:13 [gpu_model_runner.py:4033] Starting to load model Qwen/Qwen2.5-Math-1.5B-Instruct...

Now some of the output from the bottom once it finished firing up:

(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO:     172.17.0.1:41382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 03-18 12:24:06 [loggers.py:257] Engine 000: Avg prompt throughput: 3.1 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 12:24:16 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

so after the server started I ran the test command in the play book and asked it 2+2 (need super computer powers for this!) in a curl request

sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
    "messages": [{"role": "user", "content": "2+2"}],
    "max_tokens": 500
}' 
{"id":"chatcmpl-a6cdb0d5a62eb4d2","object":"chat.completion","created":1773836642,"model":"Qwen/Qwen2.5-Math-1.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"To solve the expression \\(2 + 2\\), we follow these steps:\n\n1. Identify the numbers in the expression. Here, both numbers are 2.\n2. Add the two numbers together. When we add 2 and 2, we get 4.\n\nSo, the value of the expression \\(2 + 2\\) is \\(\\boxed{4}\\).","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":31,"total_tokens":108,"completion_tokens":77,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}sparkit@bd-it-spark01:~$ 

Now from the time I hit enter on the above to the time it spit out the answer was maybe a second, heck it might say so exactly in the text somewhere. Point it, it was lightening fast. All good so far!

Question 1. Where is the model Qwen/Qwen2.5-Math-1.5B-Instruct on my computer? Is it downloading that model every time I fire up docker and run the play book? Or did it do that the first time and now its cached somewhere, if so where? I ask because I’ve downloaded some big models from HF and they all download to a cache folder which is listed when the download finishes (downloading via CLI using the hf command).

Question 2. This model is small yes? Why is the DGX dashboard showing 122GB of 128GB being used!?! That seems wrong. I would think it should be way lower. I mean If I can run way bigger models on my 32GB Mac using LM studio and they don’t peg out the RAM how is this tiny 1.5B model that NVIDIA lists in its own playbook eating up basically all the RAM?! Misconfiguration in vLLM on my end? Misconfigured docker settings on start up of the model on my end? (I’m just following the link verbatim).

Putting the memory aside for a second, I know that say I wanted to use this sample model I would need to tie it into an API server, perhaps something like Open UI right? Would I download a docker image of just OpenUI? Again I ask because I have seen and played with the playbook that has open UI and Ollama bundled, and it works fine, I can download different models through the UI and load them. Run in parallel ect..

Now the bigger problem. Say I follow the same commands above but load a bigger model (I picked to download models that were specifically listed as supported in the playbook.

Same link just the first page of it

https://build.nvidia.com/spark/vllm/overview

Using hugginface CLI I downloaded: Nemotron 3 Super 120B, Qwen3-32B, and lastly Phi-4-mutlitmodal-instruct.

Just grabbing a few different models to test with and various sizes.

The TLDR version to come below, is that I can get the model loaded but it takes forever (fine big model) but it takes forever to respond (using the same sample query as before, just changing the model name and pointing to the cached copy of the LLM. It takes forever to spit out an answer, the server prompt shows something like 3-7 token’s per second, and when I do get a response, sometimes its repeated over and over, you’ll see below. If you read on.

I am going to load the Phi4 model just cause I picked it randomly. Remember this is downloaded on the spark already, its located in /home/sparkit/.cache/huggingface/hub/

Also when I am shutting down vLLM I just do a control + c , not sure if this is clean or not because it says the following when I do that:

^C(APIServer pid=1) INFO 03-18 12:57:05 [launcher.py:110] Shutting down FastAPI HTTP server.
[rank0]:[W318 12:57:05.834174784 ProcessGroupNCCL.cpp:1569] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.

So not sure there. But I don’t know what else to do.

Give me a few (or more than a few minutes) to spin up the phi 4 model and I’ll post its behavior below.

sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 -v /home/sparkit/.cache/huggingface/hub/models nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} vllm serve "nvidia/Phi-4-reasoning-plus-FP8"

==========
== vLLM ==
==========

NVIDIA Release 26.02 (build 270699521)
vLLM Version 0.15.1+befbc472
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.1 driver version 590.48.01 with kernel driver version 580.126.09.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for vLLM.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.12/dist-packages/torchvision/io/image.py:14: UserWarning: Failed to load image Python extension: 'Could not load this library: /usr/local/lib/python3.12/dist-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] 
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]        █     █     █▄   ▄█
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.15.1+befbc472
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]   █▄█▀ █     █     █     █  model   nvidia/Phi-4-reasoning-plus-FP8
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:325] 
(APIServer pid=1) INFO 03-18 13:00:02 [utils.py:261] non-default args: {'model_tag': 'nvidia/Phi-4-reasoning-plus-FP8', 'api_server_count': 1, 'model': 'nvidia/Phi-4-reasoning-plus-FP8'}
config.json: 1.74kB [00:00, 15.9MB/s]
(APIServer pid=1) INFO 03-18 13:00:06 [model.py:541] Resolved architecture: Phi3ForCausalLM
(APIServer pid=1) INFO 03-18 13:00:06 [model.py:1561] Using max model len 32768
(APIServer pid=1) INFO 03-18 13:00:06 [cache.py:216] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.


.....
takes forever on these:


(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [core.py:96] Initializing a V1 LLM engine (v0.15.1+befbc472) with config: model='nvidia/Phi-4-reasoning-plus-FP8', speculative_config=None, tokenizer='nvidia/Phi-4-reasoning-plus-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Phi-4-reasoning-plus-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:34313 backend=nccl
(EngineCore_DP0 pid=186) INFO 03-18 13:00:11 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [gpu_model_runner.py:4033] Starting to load model nvidia/Phi-4-reasoning-plus-FP8...
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [__init__.py:184] Selected FlashInferFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
(EngineCore_DP0 pid=186) INFO 03-18 13:00:12 [cuda.py:364] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
model.safetensors.index.json: 55.3kB [00:00, 252MB/s]
model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.03G/1.03G [01:26<00:00, 11.9MB/s]
model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.86G/4.86G [01:28<00:00, 55.2MB/s]
(EngineCore_DP0 pid=186) tensors:  21%|████████████████████████▏                                                                                          | 1.02G/4.84G [01:19<03:04, 20.7MB/s]
(EngineCore_DP0 pid=186) tensors:  99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 4.80G/4.86G [01:25<00:00, 96.0MB/s]


after they load up (I can see my RAM BTW ramping from 27GB after the first startup call up back up to close to 120GB used)

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:40<02:02, 40.71s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [01:22<01:22, 41.23s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:23<00:22, 22.74s/it]


still hovering around 27GB RAM at this point...

next block that takes awhile to load...(below)

(EngineCore_DP0 pid=186) INFO 03-18 13:04:52 [default_loader.py:291] Loading weights took 124.36 seconds
(EngineCore_DP0 pid=186) WARNING 03-18 13:04:52 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore_DP0 pid=186) WARNING 03-18 13:04:52 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore_DP0 pid=186) INFO 03-18 13:04:53 [gpu_model_runner.py:4130] Model loading took 14.74 GiB memory and 280.527628 seconds
(EngineCore_DP0 pid=186) INFO 03-18 13:04:58 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/bfe0f27543/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=186) INFO 03-18 13:04:58 [backends.py:872] Dynamo bytecode transform time: 5.22 s
(EngineCore_DP0 pid=186) INFO 03-18 13:05:04 [backends.py:302] Cache the graph of compile range (1, 2048) for later use


Now RAM is ramping to 50-60GB and growing sometimes it'll drop into the 30's and then ramp back up during this time.


(EngineCore_DP0 pid=186) INFO 03-18 13:07:57 [monitor.py:34] torch.compile takes 182.09 s in total
(EngineCore_DP0 pid=186) INFO 03-18 13:07:57 [decorators.py:576] saving AOT compiled function to /root/.cache/vllm/torch_aot_compile/9052722fd1e2e1f01edc013fed5141308769076c4ce33f460cee73f2c67cb1aa/rank_0_0/model
(EngineCore_DP0 pid=186) INFO 03-18 13:07:59 [decorators.py:580] saved AOT compiled function to /root/.cache/vllm/torch_aot_compile/9052722fd1e2e1f01edc013fed5141308769076c4ce33f460cee73f2c67cb1aa/rank_0_0/model
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [gpu_worker.py:356] Available KV cache memory: 89.37 GiB
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [kv_cache_utils.py:1307] GPU KV cache size: 937,120 tokens
(EngineCore_DP0 pid=186) INFO 03-18 13:08:01 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 28.60x
(EngineCore_DP0 pid=186) 2026-03-18 13:08:09,549 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=186) 2026-03-18 13:08:23,415 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=186) INFO 03-18 13:08:23 [kernel_warmup.py:64] Warming up FlashInfer attention.


Now we are 120GB, it just spikes all the way up, the GPU utilization starts going haywire and jumping from 0 to 100 to everything in between..

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:11<00:00,  4.37it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:05<00:00,  6.12it/s]
(EngineCore_DP0 pid=186) INFO 03-18 13:09:15 [gpu_model_runner.py:5063] Graph capturing finished in 18 secs, took 1.22 GiB
(EngineCore_DP0 pid=186) INFO 03-18 13:09:15 [core.py:272] init engine (profile, create kv cache, warmup model) took 262.32 seconds
(EngineCore_DP0 pid=186) INFO 03-18 13:09:17 [vllm.py:624] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 03-18 13:09:17 [api_server.py:668] Supported tasks: ['generate']
(APIServer pid=1) WARNING 03-18 13:09:17 [model.py:1371] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.8, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 03-18 13:09:17 [serving.py:177] Warming up chat template processing...
(APIServer pid=1) INFO 03-18 13:09:18 [hf.py:310] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 03-18 13:09:18 [serving.py:212] Chat template warmup completed in 1080.3ms
(APIServer pid=1) INFO 03-18 13:09:19 [api_server.py:949] Starting vLLM API server 0 on http://0.0.0.0:8000

..

now it shows a bunch of info messages about route: /v1/chat and others but ultimately now it says the server is ready.






Now I run my simple 2+2 query in another shell, (again ram is pegged at 121GB) GPU 0%.

sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "nvidia/Phi-4-reasoning-plus-FP8",
    "messages": [{"role": "user", "content": "2+2"}],
    "max_tokens": 500
}'

It immediately logs the following on the server side, now we wait for quite some time, memory never drops

APIServer pid=1) INFO 03-18 13:14:49 [loggers.py:257] Engine 000: Avg prompt throughput: 22.9 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1) INFO 03-18 13:14:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:39632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 03-18 13:15:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 03-18 13:15:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%


and of course, this time it didn’t take as long as it normally does (usually minutes) maybe its cached my 2+2? I’ll try something different in a second, but here is the client side output:

sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "nvidia/Phi-4-reasoning-plus-FP8",
    "messages": [{"role": "user", "content": "2+2"}],
    "max_tokens": 500
}'
{"id":"chatcmpl-a49efea3d38eeb08","object":"chat.completion","created":1773839688,"model":"nvidia/Phi-4-reasoning-plus-FP8","choices":[{"index":0,"message":{"role":"assistant","content":"<think>User query: \"2+2\". I need to produce answer. Possibly a plain arithmetic addition: 2+2 = 4.\n\nWait, but careful: \"2+2\" is simple arithmetic addition. But maybe the user expects something else? Possibly \"2+2\" is a math problem. But instructions \"2+2\" might be interpreted as a request for help with addition.\n\nBut I should check instructions: \"2+2\", which is math expression.\n\nI check that the answer is \"4\". However, I must check if there's any potential hidden trick. Possibly the user is asking for \"2+2\" and my answer should be \"4\" but also maybe \"2 plus 2 equals 4\" but then I need to check if there is any content policy risk? But it's safe.\n\nBut also check if arithmetic is not a code. Actually it's safe.\n\nI'll simply produce \"2+2=4\" explanation maybe.\n\nBut check instructions: \"2+2\", simply output answer \"4\" in plain text.\n\nI check conversation instructions: \"2+2\", it's a math expression. The answer is \"4\". Possibly I'll produce explanation.\n\nI produce: \"2+2 equals 4.\"\n\nI'll produce answer: \"4\" and maybe \"two plus two equals four\" in plain text.\n\nI'll produce answer in plain text: \"2+2=4.\" Possibly I'll produce explanation: \"It equals 4.\" Possibly I'll produce answer explanation: \"2+2=4\" with details.\n\nI'll produce: \"2+2 is 4.\"\n\nI'll produce answer with careful explanation: \"2+2 equals 4.\" I'll produce answer as \"4.\"\n\nI'll produce answer: \"2+2=4\" with explanation if needed.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"The sum of 2 and 2 is 4.\"\n\nI'll produce answer: \"2+2=4.\"\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"4.\" and explanation: \"the addition of two and two equals four.\" I'll produce answer in plain text.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"2+2 = 4.\"\n\nI'll produce answer in plain text.\n\nI'll produce answer: \"4.\"\n\nI'll produce answer: \"","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":229,"total_tokens":729,"completion_tokens":500,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}sparkit@bd-it-spark01:~$ 

Very annoyed this ran fast! I swear they were taking ages! Let me try a different query I’ll ask it to print some text…

I’m not touching the server side..

sending a new request now lets see how long it takes...

sparkit@bd-it-spark01:~$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "nvidia/Phi-4-reasoning-plus-FP8",
    "messages": [{"role": "user", "content": "Display the following text on the screen: I'm a newbie! Help!"}],
    "max_tokens": 500
}'

on the server side, we see

(APIServer pid=1) INFO 03-18 13:15:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

(APIServer pid=1) INFO 03-18 13:15:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%


waiting.....RAM 121GB used..I guess there are time stamps here so I dont need to emphasize the time its taking GPU usage 0% getting coffee this is taking forever.. I suppose I can take this time and say that I barely even know what tokens are, no idea what GPU KV cache usage is, I am assuming KV is key-value pair like in python (I do know very basic python). I am not asking you to teach me everything, just help me with some fundamentals to get models to run normally, this seems insanely slow, I scared to even try the 120B models! I am assuming its still doing something, where I typed in the curl request I have a new prompt that I can type in, yet the server is still on the portion posted above so I guessing something is happening or else I would think it would have errored out. If I had to speculate I would guess its my ignorance in docker and the passing the right arguments to run the model correctly, but I dont know docker, (i am trying to learn) plus there are some many LLM specific things you can set, it just leaves me lost. I tutorial just gets it up and running but I guess I have to go read about vllm on my own and docker with vllm I guess? Ok I take that back about having a command prompt on the client side, it appears I do but i cant run any commands they dont return anything, not even ls works. So it must still be held hostage by the query. This is the worst I have seen it.. nothing is happening anywhere. I spun up a 3rd shell, ran top, I see VLLM::EngineCore at the top of the list, followed by (they jump around in top) containerd, node, code-ce99c1ed2 (guessing the UUID of the docker instance?) oddly enough the CPU is not showing as pegged on any process, the are all almost idle!.



 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                         
  11086 root      20   0  148.7g   4.3g 948296 S   3.0   3.6   6:31.31 VLLM::EngineCor                                                                                                                                                                                                                 
   4439 sparkit   20   0   21.4g  97500  27392 S   0.7   0.1   0:09.18 node                                                                                                                                                                                                                            
  17949 sparkit   20   0   25472   5616   3416 R   0.7   0.0   0:00.49 top                                                                                                                                                                                                                             
  10776 root      20   0 9562132 873112 196824 S   0.3   0.7   0:10.38 vllm                                                                                                                                                                                                                            
      1 root      20   0   23460  12892   7672 S   0.0   0.0   0:03.67 systemd                                                                                                                                                                                                                         
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.04 kthreadd                                                                                                                                                                                                                        
      3 root      20   0       0      0      0 S   0.0   0.0   0:00.00 pool_workqueue_release                                                                                                                                                                                                          
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-rcu_gp                                                                                                                                                                                                                
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-sync_wq                                                                                                                                                                                                               
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/R-kvfree_rcu_reclaim 


top - 09:42:07 up  1:31,  7 users,  load average: 0.24, 0.21, 0.41
Tasks: 438 total,   1 running, 437 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem : 95.5/124545.5 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||     ] 
MiB Swap:  0.0/16384.0  [                                                                                                    ] 

while we are waiting here’s more info..

sparkit@bd-it-spark01:~$ docker images
                                                                                                                                                                                                                                                                                    i Info →   U  In Use
IMAGE                           ID             DISK USAGE   CONTENT SIZE   EXTRA
nvcr.io/nvidia/vllm:26.02-py3   1bec659df629       21.9GB         6.77GB    U   

here’s an image of the 2nd shell where i ran the query

serverside, still sitting there..

I guess I’m going to kill it, normally what I see is

the tokens start out high and drop and eventually I get an answer, sometimes the answer is repeated multiple times. basically saying the same thing like a loop. I wish I had an example.

(APIServer pid=1) INFO 03-18 13:14:49 [loggers.py:257] Engine 000: Avg prompt throughput: 22.9 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 1 reqs, 

where it says 22.9 token/s sec, this is where I see the numbers (usually except this case where it seems just stuck..) drop down to 9 tokens 3 tokens sometimes towards the end it says 1 token per second. before finally giving me a result.

So I am going to cancel this run, cause clearly nothing is happening. NO idea why. I will post on this same thread another retry same model same query but maybe increase the max_tokens on the requesting side. I have a meeting in 10 minutes. But I guess maybe this is a good example since now its not even doing anything. Any and all help is most humbly appreciated. I know there are many eyerolling going on. I get it. I’m a newbie, and just trying to get to a point where the models run decently and don’t eat up all the memory (I don’t think that’s normal, maybe it is for this model), I wanted to work on getting openUI in a container to then ‘connect’ it to this vllm so I could chat with it, that’s the right approach right? I’m so sorry guys I’m really trying here, 8 hrs a day plus on my own time reading, ect… and always struggling. OpenUI/Ollama was the only “success” I had but’s not what we are using so while it was fun to play and test out models, that’s not the route our company has chosen. I know there’s 8 million ways to tune this for every particular model. I was hoping it would at least ‘run’ so I could get it talking to openUI and THEN figure out how to tune, guess not. Another example will come. Thank you to everyone who stuck to this very long thread and took the time to read it and respond, you are amazing and I thank you for any bit of help. Bless you! Cheers. Hoping for some responses :) remember dumb it down, I feel like a child with crayons learning to color with this spark unit!

The Nvidia playbooks are just there to test a bunch of things to see how different tech stacks and ideas can be implemented. Once you’re done, wipe your hard drive and start from scratch.

Look at the following projects to get started with vLLM:

But of course, you have other options if you prefer llama.cpp per example.

The one thing you’re going to have to think about in your context is how the boxes are going to be used? Single or multi-user? For a single admin setting things up for the office, you can follow most of the instructions on here and a lot of caching will be done in your user folder, but in a multi-user environment, you will have a lot of extra work to make sure all users can access shared resources or you’ll fill up the SSD very quickly.

  • Question 1: In a huggingface cache in your user folder. Will only be downloaded if there is a newer snapshot
  • Question 2: KV cache is taking as much memory it can unless you cap vLLM at the “GPU level” or the KV cache itself.
  • Open WEBUI: Yes, build a docker compose and connect it to your running vLLM instance. No need for Ollama if you install something like Qwen3.5-122B. OWUI comes with lots of goodies to get started and that model can do everything except generate images. You can also get a Langflow docker and build something similar to OWUI, and simpler and more useful, really quickly.
  • Some of the bugs you saw happen from time to time. Restrart vLLM, clear the cache, etc.
4 Likes

Thank you for sharing the knowledge and being kind to a beginner. I did stumble across someone posting they had managed to get vLLM usage on NVIDIA NVFP4 models down to manageable sizes. That’s probably what you are referring to. I’m going to look into that more. I just hope its ‘newbie’ friendly. I found a github page from the guy, looks like exactly what I want, NO IDEA how to even start :(. More reading and tinkering and asking. My use case is to have the Spark run Agent Zero as a “system admin” performing mundane tasks, monitoring ect.. once tuned and configured of course. Right now I’ve been using LMStudio CLI on the spark as that seems to give me the most memory relief and speed, though its a struggle. I haven’t found a good main chat model that doesn’t start to assume things ect.. I have so much to learn its insanity. You all are frigging amazing for knowing this stuff!

Hi @Anglerfish, it’s definitely worth persevering with the Spark though I know it can be daunting in the beginning!

@sggin1 wrote a great post on where all the RAM goes when using vLLM which is worth reading:

The main point is that vLLM by default configures itself to assume you’re going to be making a lot of requests in parallel, as though it’s serving a workgroup or running a benchmark. This is fine if that’s what you’re doing but can be a bit confusing at first and is not necessary for simple single user conversation.

Ollama is actually a little simpler to use. Arguably less powerful, but I found it easier when I was starting to use the Spark. Unfortunately, its models are in a completely different format, and cached in a different directory, so you end up having the same model downloaded twice into two different areas!

Let me know if you have any specific issues and I may be able to help. Good luck!!!

1 Like

P.S. In terms of general chat models that don’t do or assume weird things, I’ve found gpt-oss:120b to be excellent. It’s not the most modern or whatever but it’s a fantastic ‘training wheels’ LLM offering brilliant performance both speed wise and in terms of high quality output. It’s a little older so that reflects in its knowledge cutoff date. It doesn’t exhibit the delays I see with other models sometimes.

1 Like

Than k you @j0n for the kind words and guidance. Yeah I found some other posts from newbies like me also asking about vLLM and its RAM hogging. I get it now its by design, and I’ve been watching YT videos to learn more about VLLM and inference engines in general and quantization ect… I’ve been using @eugr ‘s implementation (or rather serious tweaking) of vLLM with some of the models he has in there. I was able to get my memory lowered to a more manage able size. Yesterday during testing with his vLLM version I was running the Qwen 3.5 30B model he includes in his play book (A3B) FP8.

I tuned it down just a hair to .65 on the GPU because after a long set of questions (probing Agent Zero for flaws or edge cases I need to train for) it was pushing 123GB and that was a little too much for my comfort. So with .65 it dropped dramatically to 116GB max. That I’m ok with. The reality is even if I’m “wasting” RAM this Spark is going to do nothing else but be a backend for Agent Zero. It was spitting out answers super fast and very detailed, I was keeping an eye on the vLLM Instance and it was doing about 40-50 tokens/sec and the kv cache hit rate steadily climbed and towards the end of my testing it was at around 70%, which according my work’s “Co-Pilot Enterprise” AI , was pretty darn good. So yeah I’m using AI (works CP) to help me with AI lol. But Yesterday I followed the playbook from NVIDIA and I quantized for fist LLM aww! lol.

I’m also using LM Studio to host the utility model is is a minstral 14B model. Along with a small text embedding model running local ai (all docker containers) I know its a hodgepodge of engines spread out. As I learn more hopefully I can get a cleaner setup.

Right now its working nicely so I can actually start working to train and gate the Agent zero product, even if I end up switching the backend at least the training has been done.

I’m lucky in the fact that I learned micropython because of my interest in microcontrollers. Which led me to learning Python3 (I would say beginner still but I can read it and understand the code, I may not be able to write crazy advanced stuff by myself) so that helps for troubleshooting and learning about how the spark works since it’s heavy on Python.

Just this morning I faced a stupid issue that I couldn’t solve. I had vLLM running with all the other models I spoke about ect.. I saw there were updates to the spark so I wanted to stop everything cleanly and then patch. Well. For reason unknown I was unable to stop the docker container for vLLM. It gave me some error about the daemon. Normally I leave the terminal window open on my work machine so I can just control C on it. But I powered down my work machine so I didn’t have access anymore to that live running vLLM instance that I could control C. I was going to PM @eugr to see if he could tell me how to do it. I feel stupid. I AM stupid at all this. I run your vLLM using the script which starts it nicely, but if I close that terminal, yeah I’m on the stugglebus on how to cleanly shut it down. I would surely appreciate help there. I can’t always have the terminal open. I run the spark headless too.

I also noticed that if I tried to use a different model than what was included in the vLLM github package it was horribly slow. I don’t know why. I don’t know how to fix it or even troubleshoot it. For example I downloaded Nemotron-Nano-9B and tried to run it in your version of vLLM and it was so slow to return a simple curl. Being so new and dumb to all of this I dont even know if there’s a way to force that model (or any new model really) to “work” correcly with your vLLM. Which I should say THANK YOU for your contributions. I’ve seen your posts working with NVIDIA employees to help address issues. You are a ace in my book, a help gentlemen helping an emerging community and ecosystem. So I tip my hat to you sir!

I was reading a little about NVIDIA NIM. I don’t know if that might also be a viable option. I know enough to say that yes I could spin up ollama and have this work pretty quickly with little tweaking. But my gut and from what I can tell in NVIDIA docs, that’s not the inference engine you really want. I’m trying to figure out what is the “best” engine for my use case (1-6 users max and likely never 6 at at time maybe 2 or 3 tops) using Agent Zero for Sys Admin tasks. I need Agent zero to respond quickly for simple question at least and its ok if it has to think longer for harder things. But during my endless testing sometimes I was getting responses to a simple “Say hello.” taking 2 minutes! then asking longer questions was horrible up to 5 minutes and it was lagging as it spit out the answer (which was ok’sh) but no. My work wont settle for that. Then I found this vLLM and with qwen that came in the config (I know I downloaded the model or rather the script did) now using A0, I say say connected and it replies within 2 seconds tops. Long deep questions may take 30 seconds or so but it starts spitting out the huge long answer immediately so its a good user experience you see answers coming quickly even if the full answer takes awhile, if that makes sense.

I’m still open to suggestions. I’m not locked into anything my boss basically gave me a ‘blank check’ in terms of ‘how to configure the spark’ so right now. Its working, its eating a total of 116GBs and working fast in A0. So I’m going with it for now. I did do a comparison with nemotron-nano-30B-A3B active NVFP4 and the responses were fast, but man there was no “depth’“ to the answers. The Qwen model smoked it.

So for 2 solid weeks now of work hours I’ve been doing nothing but reading, testing models, engines, and well learning everything. I know basic linux, basic python, the rest I’m learning as I go. I didn’t know docker before this. I didn’t even know how AI worked. I used my works Co-Pilot license we have, and it works pretty darn good for answering quick troubleshooting questions here and there but that was my exposure to AI. Nothing. End user! My boss encouraged me to load LM Studio and another product called Inferencer (MAC OS only) and I started playing with that (this is pre spark) and wow. An entire new world of IT and computers was right in front of me. Luckily I have a decent MacbookPro Ultra that has unified memory (only 32GBs but still) I was able to tinker with models. That opened up my eyes. I was like ‘oh dang, I’m so behind the curve’.

I’m forging ahead. I will post more questions. Be kind to someone who still has no idea really what I’m doing. If anyone can help me figure how to cleanly shutdown that instance that would be cool. I just ended up letting the spark update itself and reboot in hopes it cleanly shutdown. But I dont know :( I kid you not I dream about AI and work, that’s how much I’m racking my brain everyday on this stuff.

Ok post is too long and ranting. Sorry. Updates are complete. Going to fire up the engines and start working on configuring Agent zero. Any and all suggestions about anything and everything are always welcome! I have a hot mess of downloaded models everywhere but I’m still plenty good on space (4 TB SSD). Cheers mates! To all know this stuff like the bank of your hand, how did you do it!? It feels like every little part of AI has a whole college degree’s worth of info. I’m getting old I guess. But I gotta catch up and stay relevant! I have an engineering degree so I am able to grasp complex things, its just hard because I dont even know where to start and what areas to invest more of my time in. I’m ‘drinking from the firehose’ so to speak if you’ve ever heard that saying.

Thank you guys (and gals)! I’m humbled by all of you.

You can run tmux, and start vLLM in tmux session. In this case you can still ctrl-c out of it, but can disconnect ssh anytime you want. Another option is to use -d to start it in background and use ./launch-cluster.sh stop to stop it. Alternatively you can use sparkrun (https://sparkrun.dev) - it’s more of one-click launcher that can be used from any machine on the network, not just on sparks directly.

As for the other models, whenever you choose a model outside of the recipes, make sure you:

  1. Pick a right quant - don’t run BF16 models on Spark, they will be slow, unless they are very tiny ones. Normally you would want: FP8 for small-medium models and AWQ/Int4-Autoround/NVFP4 for larger ones. Choose MoE models over dense.
  2. Go to https://spark-arena.com/ and browse the benchmarks. You can download model recipes from there.
2 Likes

Dude, THANK YOU. You seriously are awesome man. I’de buy ya a beer if you were local to me! :) I must have done something wrong because I tried using your vLLM with a Nemotron 9B NVFP4 model and it just crawled miserably. I just basically do the docker runtime config you have in the folder for vLLM and replaced the LLM with the Nemo 9B . I don’t know if that’s the right approach or not. You can PM me if you want (I most certainly would love to have a line of comms directly with you if you don’t mind dealing with a fool who is a newb!) I was checking out those websites, I’m going to test them out and install the software too. I could ask you 1000000 things but I wont I respect your time and not being my personal Spark knowledge base. But If you would PM me so we could chat from time to time when I feel I’m doing something either wrong or just crappy and inefficient (basically my gut tells me ehhh this seems wrong or too slow when it shouldn’t be) I would love to be able to ping you. But totally respect your time and privacy so no hard feelings if you dont want to. Either way thank you again for the help on a very dumb question. Cheers brother!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.