I’ve been working with Nemotron Nano VL 12B on the DGX Spark and thought I’d try out some alternative options. Here s a quick comparision: (Edit, see details on the 34B float 16 in reply)
| Feature | LLaVA-NeXT-Mistral | Nemotron Nano VL 12B |
|---|---|---|
| Active Params: ~13B | Active Params: 12B | |
| Intelligence | Deep Reasoning. Better at “thinking” through complex images, analyzing charts, and broad world knowledge. | Task Optimized. specialized for OCR, reading documents, and edge-use cases. |
| Latency | Medium. It has to route tokens between experts. | Ultra-Low. It is a straight shot through a smaller network. |
| Architecture | Dense Model | Dense Model |
| On DGX Spark | Perfect Fit. Uses the massive 128GB memory to store the experts, but runs fast because it only uses active experts. | Overkill. The DGX Spark is so powerful it could run 5-6 instances of this model simultaneously. |
| Total Params: ~47B | Total Params: 12B |
Here is the Dockerfile
FROM nvcr.io/nvidia/pytorch:25.10-py3
ENV DEBIAN_FRONTEND=noninteractive
# 1. System Deps
RUN apt-get update && apt-get install -y \
libgl1 \
libglib2.0-0 \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# 2. Python Setup
RUN pip install --upgrade pip
RUN pip uninstall -y torchao || true
# 3. Install Standard Python Deps
RUN pip install \
transformers==4.46.3 \
accelerate>=1.0.0 \
bitsandbytes \
pillow \
requests \
protobuf \
scipy \
fastapi \
uvicorn \
python-multipart
# 4. OPTIMIZED FLASH ATTENTION BUILD
# MAX_JOBS=4 prevents memory crash.
# TORCH_CUDA_ARCH_LIST="10.0" tells it to ONLY build for Blackwell (GB10).
# This skips building for Volta (V100), Ampere (A100), and Hopper (H100).
ENV MAX_JOBS=4
ENV TORCH_CUDA_ARCH_LIST="10.0"
RUN pip install flash-attn --no-build-isolation --upgrade --no-cache-dir
WORKDIR /app
Build it
docker build -t llava-next-mistral:v1 .
Create a server.py file
import uvicorn
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
from PIL import Image
import torch
import io
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
app = FastAPI()
# --- Load Model ---
print("Initializing LLaVA-NeXT-Mistral on DGX GB10...")
# Hardcode device to cuda since Flash Attention requires it
device = "cuda"
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
# Load Processor
processor = LlavaNextProcessor.from_pretrained(model_id)
# Load Model
# FIX: We use 'attn_implementation' and 'device_map' to load directly to GPU
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation="flash_attention_2",
device_map="cuda"
)
print(f"Model loaded successfully on {device}!")
@app.post("/generate")
async def generate(image: UploadFile = File(...), prompt: str = Form(...)):
try:
# Read image
image_content = await image.read()
pil_image = Image.open(io.BytesIO(image_content))
# Format Prompt
full_prompt = f"[INST] <image>\n{prompt} [/INST]"
# Inference
# Note: We don't need .to(device) here for the model, it's already there.
# But inputs must be moved.
inputs = processor(full_prompt, pil_image, return_tensors="pt").to(device)
output = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.2
)
# Decode
generated_text = processor.decode(output[0], skip_special_tokens=True)
# Clean response
response_only = generated_text.split("[/INST]")[-1].strip()
return {"response": response_only}
except Exception as e:
return JSONResponse(content={"error": str(e)}, status_code=500)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Run the docker container
docker run -it --rm \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-v $(pwd)/scripts:/app/scripts \
-v $(pwd)/models:/root/.cache/huggingface \
llava-next-mistral:v4 \
python3 /app/scripts/server.py
Test it
# Download a dummy image to test if you don't have one
wget https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png -O test_image.png
# Send the request
curl -X POST "http://localhost:8000/generate" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "prompt=Describe this image." \
-F "image=@test_image.png"
Use LLaVA-NeXT-Mistral if:
-
You need the model to explain why something is happening in an image.
-
You are analyzing complex scenes with many objects.
-
You are doing “General Purpose” chat (e.g., asking it about history, coding, or art based on an image).