LLaVA-Mistral multimodal (7B & 34B)

WayneV · December 12, 2025, 4:28pm

I’ve been working with Nemotron Nano VL 12B on the DGX Spark and thought I’d try out some alternative options. Here s a quick comparision: (Edit, see details on the 34B float 16 in reply)

Feature	LLaVA-NeXT-Mistral	Nemotron Nano VL 12B
	Active Params: ~13B	Active Params: 12B
Intelligence	Deep Reasoning. Better at “thinking” through complex images, analyzing charts, and broad world knowledge.	Task Optimized. specialized for OCR, reading documents, and edge-use cases.
Latency	Medium. It has to route tokens between experts.	Ultra-Low. It is a straight shot through a smaller network.
Architecture	Dense Model	Dense Model
On DGX Spark	Perfect Fit. Uses the massive 128GB memory to store the experts, but runs fast because it only uses active experts.	Overkill. The DGX Spark is so powerful it could run 5-6 instances of this model simultaneously.
	Total Params: ~47B	Total Params: 12B

Here is the Dockerfile

FROM nvcr.io/nvidia/pytorch:25.10-py3

ENV DEBIAN_FRONTEND=noninteractive

# 1. System Deps
RUN apt-get update && apt-get install -y \
    libgl1 \
    libglib2.0-0 \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

# 2. Python Setup
RUN pip install --upgrade pip
RUN pip uninstall -y torchao || true

# 3. Install Standard Python Deps
RUN pip install \
    transformers==4.46.3 \
    accelerate>=1.0.0 \
    bitsandbytes \
    pillow \
    requests \
    protobuf \
    scipy \
    fastapi \
    uvicorn \
    python-multipart

# 4. OPTIMIZED FLASH ATTENTION BUILD
# MAX_JOBS=4 prevents memory crash.
# TORCH_CUDA_ARCH_LIST="10.0" tells it to ONLY build for Blackwell (GB10).
# This skips building for Volta (V100), Ampere (A100), and Hopper (H100).
ENV MAX_JOBS=4
ENV TORCH_CUDA_ARCH_LIST="10.0"
RUN pip install flash-attn --no-build-isolation --upgrade --no-cache-dir

WORKDIR /app

Build it

docker build -t llava-next-mistral:v1 .

Create a server.py file

import uvicorn
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
from PIL import Image
import torch
import io
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

app = FastAPI()

# --- Load Model ---
print("Initializing LLaVA-NeXT-Mistral on DGX GB10...")
# Hardcode device to cuda since Flash Attention requires it
device = "cuda"
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"

# Load Processor
processor = LlavaNextProcessor.from_pretrained(model_id)

# Load Model
# FIX: We use 'attn_implementation' and 'device_map' to load directly to GPU
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

print(f"Model loaded successfully on {device}!")

@app.post("/generate")
async def generate(image: UploadFile = File(...), prompt: str = Form(...)):
    try:
        # Read image
        image_content = await image.read()
        pil_image = Image.open(io.BytesIO(image_content))

        # Format Prompt
        full_prompt = f"[INST] <image>\n{prompt} [/INST]"

        # Inference
        # Note: We don't need .to(device) here for the model, it's already there.
        # But inputs must be moved.
        inputs = processor(full_prompt, pil_image, return_tensors="pt").to(device)
        
        output = model.generate(
            **inputs, 
            max_new_tokens=200,
            do_sample=True,
            temperature=0.2
        )

        # Decode
        generated_text = processor.decode(output[0], skip_special_tokens=True)
        
        # Clean response
        response_only = generated_text.split("[/INST]")[-1].strip()

        return {"response": response_only}

    except Exception as e:
        return JSONResponse(content={"error": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run the docker container

docker run -it --rm \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
	-p 8000:8000 \
    -v $(pwd)/scripts:/app/scripts \
    -v $(pwd)/models:/root/.cache/huggingface \
    llava-next-mistral:v4 \
	python3 /app/scripts/server.py

Test it

# Download a dummy image to test if you don't have one
wget https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png -O test_image.png

# Send the request
curl -X POST "http://localhost:8000/generate" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "prompt=Describe this image." \
     -F "image=@test_image.png"

Use LLaVA-NeXT-Mistral if:

You need the model to explain why something is happening in an image.
You are analyzing complex scenes with many objects.
You are doing “General Purpose” chat (e.g., asking it about history, coding, or art based on an image).

WayneV · December 12, 2025, 4:40pm

I’ll post the config for the llava-v1.6-34b-hf model next and then also the MoE version based on Mixtral or 8x7B

WayneV · December 12, 2025, 7:26pm

I’ve created a server for running the 34B model in float 16 with streaming, you need to install this:

pip install sentencepiece

And then there server.py for the larger model:

import uvicorn
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
from PIL import Image
import torch
import io
import threading
from transformers import (
    LlavaNextProcessor, 
    LlavaNextForConditionalGeneration, 
    TextIteratorStreamer
)

app = FastAPI()

# --- CONFIGURATION ---
MODEL_ID = "llava-hf/llava-v1.6-34b-hf"

print(f"Initializing {MODEL_ID} in Float16 (High Accuracy)...")

# 1. Load Processor
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)

# 2. Load Model
# FIX: device_map="cuda" forces initialization directly on the GPU.
# This satisfies Flash Attention 2 requirements.
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map="cuda"  
)

print("--- Model Loaded & Ready ---")

@app.post("/generate")
async def generate(image: UploadFile = File(...), prompt: str = Form(...)):
    try:
        # Process Image
        content = await image.read()
        pil_image = Image.open(io.BytesIO(content))

        # Yi-34B Prompt Template
        full_prompt = f"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\n{prompt}<|im_end|><|im_start|>assistant\n"

        # Explicit inputs moved to CUDA
        inputs = processor(text=full_prompt, images=pil_image, return_tensors="pt").to("cuda")

        # Setup Streamer
        streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
        
        # Generation Arguments
        generation_kwargs = dict(
            inputs, 
            streamer=streamer, 
            max_new_tokens=512, 
            do_sample=True, 
            temperature=0.2, 
            top_p=0.9
        )

        # Run generation in a background thread
        thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
        thread.start()

        # Collect the stream
        def iter_response():
            accumulated_text = ""
            print("\nGenerating: ", end="") # Visual marker in server logs
            for new_text in streamer:
                # Print to server console for immediate feedback
                print(new_text, end="", flush=True) 
                accumulated_text += new_text
            return accumulated_text

        # Wait for completion and return full string
        final_response = iter_response()
        print("\n--- Done ---")
        
        # Clean up the output
        clean_response = final_response.split("assistant\n")[-1].strip()
        
        return {"response": clean_response}

    except Exception as e:
        print(f"Error: {e}")
        return JSONResponse(content={"error": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Why I like/prefer the 34B model:

Unlike the 7B or 13B models, Yi-34B has a much deeper “world understanding.”

Complex Instruction Following: If you ask it to “Extract all the text in the red box and format it as JSON,” it will likely succeed where the 7B model would hallucinate or just give raw text.

OCR (Optical Character Recognition): It is excellent at reading dense text, charts, and documents compared to smaller models.

Nuance: It can explain why something is funny or unusual in an image, rather than just listing objects.

LLaVA-NeXT uses a technique called AnyRes, which allows it to see high-resolution images.

Old LLaVA (v1.5): Resized everything to a blurry 336x336 square.

This Model (v1.6): Dynamically slices the image into patches. It can see up to 4x more pixels than v1.5. This means it can spot small text or details in the corner of a large diagram.

Being able to simply asking it to “Think step by step” in the prompt. The 34B model is smart enough to use Chain-of-Thought reasoning.

eugr · December 12, 2025, 9:47pm

These models are pretty old now - there are better and faster models available.

The SOTA vision models now are:

Qwen3-VL series
GLM-4.6V
Deepseek-OCR - this one is suitable for OCR/document parsing only

And no need to create a custom Python endpoint - vLLM works just fine with all of them (well, Deepseek-OCR is a bit special here). Or llama.cpp for Qwen3-VL series.

raphael.amorim · December 12, 2025, 10:41pm

Yes, I would give it a try with nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD, it’s quite decent. Even though it looks like NVFP4 support is still lacking, it’s worth trying out.

https://forums.developer.nvidia.com/t/running-nvidia-nemotron-nano-vl-12b-v2-nvfp4-qad-on-your-spark/350349/5

WayneV · December 13, 2025, 4:25am

You right, I’ve been toying with Qwen3-VL but haven’t had the time to create a image/container for the Spark yet. That’s my next mini project

eugr · December 13, 2025, 5:37am

You don’t need to create a separate image or a python wrapper for each model. Just use vllm or sglang container (either from NVidia or one of ours here in the forum) and load models there.

Also, no need to use BF16 weights - use one of the quantized versions. FP8/AWQ 8-bit would be pretty much identical in terms of quality, but perform 2x faster and 2x smaller in size. In most cases, FP4/AWQ 4-bit will have very little loss of quality, but will take 4x less space and perform 4x faster than the original 16-bit ones.

Topic		Replies	Views
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1636	June 20, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	9664	March 27, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	10691	March 31, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2953	February 25, 2026
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	1417	May 11, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	12318	April 9, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	867	March 3, 2026
Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama) DGX Spark / GB10 Projects spark , jetson , llama , nemotron , openclaw	19	4290	May 28, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3365	December 17, 2025
[Benchmark] nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 DGX Spark / GB10 Projects cuda , spark , jetson , llm , nemotron	5	1389	May 1, 2026

LLaVA-Mistral multimodal (7B & 34B)

Related topics