GDX Spark is extremely slow on a short LLM test

llama.cpp comes with a builtin chat interface that is quite capable.

As for Open WebUI you can add a new connection for llama.cpp as it is OpenAI API compatible.

Depending on the port you choose for llama.cpp (I think 8000 is the default) you just need to add this to your URL. When Open WebUI is also running inside a container you will need to exchange the localhost with the IP address of your primary network interface of the Spark.

So http://192.168.0.123:8000/v1 (example) would be an URL you would have to enter in your connection.

See llama.cpp/docs/docker.md at master · ggml-org/llama.cpp · GitHub for using docker images. For example:

docker run --gpus all -p 8000:8000 -v $HOME/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda -s -hf ggml-org/gpt-oss-120b-GGUF --port 8000 --host 0.0.0.0 -c 0 --jinja

-hf for downloading a model directly from Hugging Face – in this case ggml-org/gpt-oss-120b-GGUF · Hugging Face

There also helper to ease the use of llama.cpp like llama swap - not sure if there are ready to use arm64 images for that already.

Still waiting for ASUS Germany to perform… so I can’t test it myself yet.