txt2kg Knowledge Graph Triple Extraction is slow (more than 10 mins) using the existing system prompt and LLM model - Ollama Qwen3 1.7B. My sample text is tiny with only 130kB, any idea why?
Could you add some more details of what you’ve set up and how it is supposed to work?
Thanks.
GIt clone and then run ./start.sh, see below, thanks mate
./start.sh
Checking for GPU support…
✓ NVIDIA GPU detected
GPU: NVIDIA GB10, [N/A]
Using Docker Compose V2
Checking Docker permissions…
✓ Docker permissions OK
Using ArangoDB + Ollama configuration…
Starting services…
Running: docker compose -f /home/brianho/project/dgx-spark-playbooks/nvidia/txt2kg/assets/deploy/compose/docker-compose.yml up -d
[+] Running 8/8
✔ Network compose_txt2kg-network Created 0.0s
✔ Network qdrant-network Created 0.0s
✔ Network compose_default Created 0.0s
✔ Container ollama-compose Starte… 0.3s
✔ Container compose-arangodb-1 St… 0.2s
✔ Container compose-arangodb-init-1 Started 0.3s
✔ Container compose-backend-1 Sta… 0.3s
✔ Container compose-app-1 Started 0.4s
==========================================
txt2kg is now running!
Core Services:
• Web UI: http://localhost:3001
• ArangoDB: http://localhost:8529
• Ollama API: http://localhost:11434
Next steps:
-
Pull an Ollama model (if not already done):
docker exec ollama-compose ollama pull llama3.1:8b -
Open http://localhost:3001 in your browser
-
Upload documents and start building your knowledge graph!
Other options:
• Stop services: ./stop.sh
• Run frontend in dev mode: ./start.sh --dev-frontend
• Use vLLM (GPU): ./start.sh --vllm
• Add vector search: ./start.sh --vector-search
• View logs: docker compose logs -f
Viewing the docker logs while the text is processing may help you see if something is slowing down the process
Also, you can view the “Troubleshooting” tab in the playbook and try out those suggestions.
For example you could try setting these env variables to help performance
Set environment variables:
OLLAMA_FLASH_ATTENTION=1 (enables flash attention for better performance)
OLLAMA_KEEP_ALIVE=30m (keeps model loaded for 30 minutes)
OLLAMA_MAX_LOADED_MODELS=1 (avoids VRAM contention)
OLLAMA_KV_CACHE_TYPE=q8_0 (reduces KV cache VRAM with minimal performance impact)
Thanks Ani,
Those env param have been set in the docker-compose.yml, see below, pls confirm this is indeed correct
Ollama - Local LLM inference
ollama:
build:
context: ../services/ollama
dockerfile: Dockerfile
image: ollama-custom:latest
container_name: ollama-compose
ports:
- ‘11434:11434’
volumes: - ollama_data:/root/.ollama
environment: - NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- CUDA_VISIBLE_DEVICES=0
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KEEP_ALIVE=30m
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KV_CACHE_TYPE=q8_0
- OLLAMA_GPU_LAYERS=-1
- OLLAMA_LLM_LIBRARY=cuda_v13
networks: - default
restart: unless-stopped
deploy:
resources:
reservations:
devices: - driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: [“CMD”, “ollama”, “list”]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
I will look into the docker log and see if I can see something
