I can run models using vLLM on docker with high-utilization. I suggest you start using llama.cpp over ollama, it’s much more performant.
Sample with vLLM: Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark
To run llama.cpp from source. Install development tools:
sudo apt install clang cmake libcurl4-openssl-dev
Checkout llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Build:
cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20
Serve gpt-oss-120b
build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF \
-fa on -ngl 999 \
--jinja \
--ctx-size 0 \
-b 2048 -ub 2048 \
--no-mmap \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--reasoning-format auto \
--chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"
Expected performance numbers with 1 or 2 sparks: