DGX Spark txt2kg playbook discrepancies / CPU fallback questions

Disclaimer: I definitely fall into the novice category; issues here may qualify as defective user…

I have been unsuccessful in running the txt2kg tool on my Spark GPU. As I went through the playbook here are the various challenges/anomalies:

Step 1: Clone the repository

  • There is a typo in the path. Missing the s at the end of dgx-spark-playbooks/…

Step 2: Start the txt2kg services
The system fails to allocate ollama to the GPU, instead falling back to CPU inference
ollama-compose | … msg=“inference compute” id=cpu library=cpu compute=“” … total=“119.7 GiB” available=“115.8 GiB”
ollama-compose | … msg=“entering low vram mode” “total vram”=“0 B” threshold=“20.0 GiB”

I dug around in services and found clear_cache_and_restart.sh in the ../deploy/services/ollama folder. As written, it didn’t have a happy path and promptly shook its fist at me. However, after correcting the path and trying again, I got the same error.

Step 5: Upload documents and build knowledge graphs
Here, I was able to connect to the service and upload a file. However, with CPU inference I realized it was just going to take too long. So, I tried to use the other NVIDIA hosted models that were listed in the model pull down.

First time through it complained that I didn’t have a key:
app-1 | … Error: NVIDIA API key is required when using NVIDIA provider. Please set NVIDIA_API_KEY in your environment variables.

So, I decided to go ahead and use the NVIDIA_API_KEY I set up when I tried the RAG application in AI Workbench demo. Alas, the selected model was not found
app-1 | Error creating or testing Nemotron model: Error: Model test failed: 404 status code (no body) app-1 | app-1 | Troubleshooting URL: MODEL_NOT_FOUND | 🦜️🔗 Langchain

At this point, the wine glass was empty and I elected to call it a night.

Questions:

  1. Has anyone successfully used this playbook to run on the GPU?
  2. Is there something straightforward I should be trying? I was hoping to explore the txt2kg a bit to see if it would be useful to me but I’m not sitting with a mission critical need so if I have to wait for updates, so be it.

Thanks!

2 Likes

Hi, this is a known issue with the Text 2 Graph playbook which we will fix

1 Like

Thanks, aniculescu. Will keep an eye out for the update.

1 Like

FIX: Change OLLAMA_LLM_LIBRARY from cuda to cuda_v13.

I had the same issue, but testing ollama image by itself shows, it’s not the image because it is able to use GPU.

# Run Ollama in a docker byitself
$ docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Test
$ docker exec ollama ollama run llama3.1:8b "test" && docker exec ollama ollama ps

NAME           ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
llama3.1:8b    46e0c10c039e    5.2 GB    100% GPU     4096       29 minutes from now  

# Locate the CUDA library.  Those name of dirs are the correct vaule for the OLLAMA_LLM_LIBRARY env var.
$ docker exec -it ollama bash
root:/# ls -l /usr/lib/ollama/
total 1568
drwxr-xr-x 2 root root   4096 Nov 13 22:01 cuda_jetpack5
drwxr-xr-x 2 root root   4096 Nov 13 21:59 cuda_jetpack6
drwxr-xr-x 2 root root   4096 Nov 13 22:12 cuda_v12
drwxr-xr-x 2 root root   4096 Nov 13 22:09 cuda_v13
-rwxr-xr-x 1 root root 857808 Nov 13 21:55 libggml-base.so
-rwxr-xr-x 1 root root 725928 Nov 13 21:55 libggml-cpu.so

So I changed OLLAMA_LLM_LIBRARY from cuda to cuda_v13.

# FIX: Change the line #61 in docker-compose.yml
    environment:
      - OLLAMA_LLM_LIBRARY=cuda_v13       # Use CUDA library 

$ ./start.sh

# Test
$ docker exec ollama-compose ollama run llama3.1:8b "test" && docker exec ollama-compose ollama ps

NAME           ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
llama3.1:8b    xxxxxxxxxxxxx   5.2 GB    100% GPU     4096       xx minutes from now  

Longer answer

OLLAMA_LLM_LIBRARY is declared as an env-config key and mentioned in the docs, but the dynamic loader that actually picks/loads runtime backends is driven by the ggml dynamic-backend loader and OLLAMA_LIBRARY_PATH (not by OLLAMA_LLM_LIBRARY alone). In other words, setting OLLAMA_LLM_LIBRARY=cuda by itself is not sufficient if the dynamic CUDA backend library is not present/compatible or if OLLAMA_LIBRARY_PATH / LD_LIBRARY_PATH / container GPU access is incorrect — in those cases the code will fall back to the CPU backend and you’ll see ~100% CPU usage.

What to check (quick checklist — run on the machine where you see 100% CPU)

  • Check which LLM libraries are present:
    ls /usr/lib/ollama or ls $(dirname $(readlink -f $(which ollama)))/../lib/ollama — list files to see cuda_v13*.so / cuda_v12*.so / cpu*.so present.
5 Likes

Good Lord, 4 characters made all the difference. I’ve been tearing my hair out all yesterday over this.

Two days ago, I watched an Nvidia live stream DGX Spark Live: Process Text for GraphRAG With Up to 120B LLM where Nvidia employees Rishi Puri, Santosh Pavani and Prachi Goel demonstrated how to use this very repository. The specifically mention that they’re using gpt-oss-120b as the underlying model, served by Ollama.

Although they show results obtained from the pipeline at various points, they don’t show the LLM in action, but if it’s only running on the CPU that big model would crawl. So what did they do to make it work? Their presentation doesn’t mention any tweaks; they are just using the code from the repository.

As I said at the beginning, the change in the environment variable OLLAMA_LLM_LIBRARY from cuda to cuda_v13 resulted in a big speed up in token generation because it allowed the use of the GPU. How could the presenters not have known that this was necessary?

3 Likes

Thanks, Neurfer! Been traveling and didn’t get back to my system until today. I will definitely make the change - really appreciate the update!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.