Building llama.cpp container images for Spark/GB10

You should see this behavior on every GPU (>90% utilization). Whenever you fire a request the GPU usage goes up to near 100% as long as your requests is being processed. After it is finished it goes down to zero again - assuming that it is a single user, sending its requests one by one. GPUs in use by multiple user might never go down to zero for a long(er) time. 😅

If you install nvtop (needs a patched version for Spark[1]) you will see something like this:

while running and when finished:

GPUs are designed for massive parallelism, meaning their thousands of cores are meant to be used all at once. That’s what makes them so fast compared to CPUs. Tasks split into many smaller pieces which can be done in parallel.

[1] NVTOP with DGX Spark unified memory support