Models not using Spark GPU?

I can run models using vLLM on docker with high-utilization. I suggest you start using llama.cpp over ollama, it’s much more performant.

Sample with vLLM: Running nvidia/Nemotron-Nano-VL-12B-V2-NVFP4-QAD on your spark

To run llama.cpp from source. Install development tools:

sudo apt install clang cmake libcurl4-openssl-dev

Checkout llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build:

cmake -B build -DGGML_CUDA=ON -DGGML_CURL=ON
cmake --build build --config Release -j 20

Serve gpt-oss-120b

build/bin/llama-server -hf ggml-org/gpt-oss-120b-GGUF \
-fa on -ngl 999 \
--jinja \
--ctx-size 0 \
-b 2048 -ub 2048 \
--no-mmap \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--reasoning-format auto \
--chat-template-kwargs "{\"reasoning_effort\": \"medium\"}"

Expected performance numbers with 1 or 2 sparks: