Hi,
I am running into a reproducible issue with the NVIDIA Embedding NIM on a DGX Spark system and would like to clarify whether this is a known compatibility problem with GB10 / ARM64 or a misconfiguration on my side.
Environment
-
Hardware: NVIDIA DGX Spark with GB10 (Grace Blackwell), 128 GB unified memory
-
CPU Arch: ARM64 / aarch64
-
OS: DGX Base OS (Ubuntu 22.04 based)
-
Driver / CUDA (from inside a CUDA container):
-
nvidia-smishows:-
Driver Version: 580.95.05
-
CUDA Version: 13.0
-
-
-
Container Runtime: Docker with NVIDIA Container Toolkit
-
Other NIMs on the same system are working:
-
LLM NIM:
nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:1.13.1(running with a validNIM_MODEL_PROFILE, GPU 0) -
Ranking NIM:
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.8.0(healthy) -
NeMo Retriever page/graphic/table NIMs are also running fine
-
-
Vector DB:
-
Milvus
milvusdb/milvus:v2.6.2-gpu(Up and healthy) -
MinIO + etcd running without issues
-
So GPU access, NIM base stack, and Milvus are all functioning on DGX Spark.
Embedding NIM in use
Embedding service from the NVIDIA RAG blueprint:
-
Image:
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.0 -
Exposed on host as:
http://localhost:9080/v1/embeddings
Request that fails
I am calling the Embedding NIM via Python (requests) from the DGX host (ARM64 venv):
import requests
import json
EMBEDDING_URL = “http://localhost:9080/v1/embeddings”
EMBEDDING_MODEL = “nvidia/llama-3.2-nv-embedqa-1b-v2”
payload = {
“model”: EMBEDDING_MODEL,
“input”: [“Dies ist ein kurzer Test für den Embedding-NIM.”],
“input_type”: “passage” # as required: ‘query’ or ‘passage’
}
resp = requests.post(EMBEDDING_URL, json=payload, timeout=60)
print(resp.status_code)
print(resp.text)
The payload is accepted syntactically (i.e., 4xx validation errors disappear once input_type is set to "passage" as required), but the service now returns an internal error.
Response
HTTP status: 500
Body:
{
“object”: “error”,
“message”: “Something went wrong with the request.”,
“detail”: “Unexpected error: onnx runtime error 1: Non-zero status code returned while running ReduceSum node. Name:‘/pooling_module/ReduceSum_1’ Status Message: CUDA error cudaErrorSymbolNotFound:named symbol not found”,
“type”: “internal_server_error”
}
So the request is valid, but the ONNX runtime inside the NIM container fails with:
cudaErrorSymbolNotFound: named symbol not found
This happens consistently for any non-trivial input; the container itself is running and reachable, just failing on actual inference.
What already works on the same system
-
Running
nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiworks fine and shows the GB10 GPU. -
LLM NIM (
llama-3.3-nemotron-super-49b-v1.5:1.13.1) is able to load and run with a compatible model profile on GPU 0. -
Ranking NIM works and responds correctly.
So GPU + CUDA stack + other NIMs are OK, the issue seems specific to this embedding model / build.
Questions
-
Is
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.0officially tested/supported on DGX Spark (GB10, ARM64, driver 580.xx / CUDA 13)? -
Is this
cudaErrorSymbolNotFounda known issue with this particular embedding NIM build on GB10, and is there a recommended tag (e.g. newer production branch) that should be used instead on DGX Spark? -
If a newer/multiarch build is already available or planned for this model (or an equivalent embedding model), could you point me to the recommended image/tag for DGX Spark?
Goal: I would like to use this embedding NIM (or an equivalent one) as the document/query encoder in an on-prem RAG setup built on DGX Spark where the LLM NIM and Milvus are already working.
Thanks in advance for any guidance or pointers.