LLaVA-Mistral multimodal (7B & 34B)

I’ve been working with Nemotron Nano VL 12B on the DGX Spark and thought I’d try out some alternative options. Here s a quick comparision: (Edit, see details on the 34B float 16 in reply)

Feature LLaVA-NeXT-Mistral Nemotron Nano VL 12B
Active Params: ~13B Active Params: 12B
Intelligence Deep Reasoning. Better at “thinking” through complex images, analyzing charts, and broad world knowledge. Task Optimized. specialized for OCR, reading documents, and edge-use cases.
Latency Medium. It has to route tokens between experts. Ultra-Low. It is a straight shot through a smaller network.
Architecture Dense Model Dense Model
On DGX Spark Perfect Fit. Uses the massive 128GB memory to store the experts, but runs fast because it only uses active experts. Overkill. The DGX Spark is so powerful it could run 5-6 instances of this model simultaneously.
Total Params: ~47B Total Params: 12B

Here is the Dockerfile

FROM nvcr.io/nvidia/pytorch:25.10-py3

ENV DEBIAN_FRONTEND=noninteractive

# 1. System Deps
RUN apt-get update && apt-get install -y \
    libgl1 \
    libglib2.0-0 \
    ffmpeg \
    && rm -rf /var/lib/apt/lists/*

# 2. Python Setup
RUN pip install --upgrade pip
RUN pip uninstall -y torchao || true

# 3. Install Standard Python Deps
RUN pip install \
    transformers==4.46.3 \
    accelerate>=1.0.0 \
    bitsandbytes \
    pillow \
    requests \
    protobuf \
    scipy \
    fastapi \
    uvicorn \
    python-multipart

# 4. OPTIMIZED FLASH ATTENTION BUILD
# MAX_JOBS=4 prevents memory crash.
# TORCH_CUDA_ARCH_LIST="10.0" tells it to ONLY build for Blackwell (GB10).
# This skips building for Volta (V100), Ampere (A100), and Hopper (H100).
ENV MAX_JOBS=4
ENV TORCH_CUDA_ARCH_LIST="10.0"
RUN pip install flash-attn --no-build-isolation --upgrade --no-cache-dir

WORKDIR /app

Build it

docker build -t llava-next-mistral:v1 .

Create a server.py file

import uvicorn
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
from PIL import Image
import torch
import io
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

app = FastAPI()

# --- Load Model ---
print("Initializing LLaVA-NeXT-Mistral on DGX GB10...")
# Hardcode device to cuda since Flash Attention requires it
device = "cuda"
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"

# Load Processor
processor = LlavaNextProcessor.from_pretrained(model_id)

# Load Model
# FIX: We use 'attn_implementation' and 'device_map' to load directly to GPU
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

print(f"Model loaded successfully on {device}!")

@app.post("/generate")
async def generate(image: UploadFile = File(...), prompt: str = Form(...)):
    try:
        # Read image
        image_content = await image.read()
        pil_image = Image.open(io.BytesIO(image_content))

        # Format Prompt
        full_prompt = f"[INST] <image>\n{prompt} [/INST]"

        # Inference
        # Note: We don't need .to(device) here for the model, it's already there.
        # But inputs must be moved.
        inputs = processor(full_prompt, pil_image, return_tensors="pt").to(device)
        
        output = model.generate(
            **inputs, 
            max_new_tokens=200,
            do_sample=True,
            temperature=0.2
        )

        # Decode
        generated_text = processor.decode(output[0], skip_special_tokens=True)
        
        # Clean response
        response_only = generated_text.split("[/INST]")[-1].strip()

        return {"response": response_only}

    except Exception as e:
        return JSONResponse(content={"error": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run the docker container

docker run -it --rm \
    --gpus all \
    --ipc=host \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
	-p 8000:8000 \
    -v $(pwd)/scripts:/app/scripts \
    -v $(pwd)/models:/root/.cache/huggingface \
    llava-next-mistral:v4 \
	python3 /app/scripts/server.py

Test it

# Download a dummy image to test if you don't have one
wget https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png -O test_image.png

# Send the request
curl -X POST "http://localhost:8000/generate" \
     -H "accept: application/json" \
     -H "Content-Type: multipart/form-data" \
     -F "prompt=Describe this image." \
     -F "image=@test_image.png"

Use LLaVA-NeXT-Mistral if:

  • You need the model to explain why something is happening in an image.

  • You are analyzing complex scenes with many objects.

  • You are doing “General Purpose” chat (e.g., asking it about history, coding, or art based on an image).

I’ll post the config for the llava-v1.6-34b-hf model next and then also the MoE version based on Mixtral or 8x7B

I’ve created a server for running the 34B model in float 16 with streaming, you need to install this:

pip install sentencepiece

And then there server.py for the larger model:

import uvicorn
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import JSONResponse
from PIL import Image
import torch
import io
import threading
from transformers import (
    LlavaNextProcessor, 
    LlavaNextForConditionalGeneration, 
    TextIteratorStreamer
)

app = FastAPI()

# --- CONFIGURATION ---
MODEL_ID = "llava-hf/llava-v1.6-34b-hf"

print(f"Initializing {MODEL_ID} in Float16 (High Accuracy)...")

# 1. Load Processor
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)

# 2. Load Model
# FIX: device_map="cuda" forces initialization directly on the GPU.
# This satisfies Flash Attention 2 requirements.
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map="cuda"  
)

print("--- Model Loaded & Ready ---")

@app.post("/generate")
async def generate(image: UploadFile = File(...), prompt: str = Form(...)):
    try:
        # Process Image
        content = await image.read()
        pil_image = Image.open(io.BytesIO(content))

        # Yi-34B Prompt Template
        full_prompt = f"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\n{prompt}<|im_end|><|im_start|>assistant\n"

        # Explicit inputs moved to CUDA
        inputs = processor(text=full_prompt, images=pil_image, return_tensors="pt").to("cuda")

        # Setup Streamer
        streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
        
        # Generation Arguments
        generation_kwargs = dict(
            inputs, 
            streamer=streamer, 
            max_new_tokens=512, 
            do_sample=True, 
            temperature=0.2, 
            top_p=0.9
        )

        # Run generation in a background thread
        thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
        thread.start()

        # Collect the stream
        def iter_response():
            accumulated_text = ""
            print("\nGenerating: ", end="") # Visual marker in server logs
            for new_text in streamer:
                # Print to server console for immediate feedback
                print(new_text, end="", flush=True) 
                accumulated_text += new_text
            return accumulated_text

        # Wait for completion and return full string
        final_response = iter_response()
        print("\n--- Done ---")
        
        # Clean up the output
        clean_response = final_response.split("assistant\n")[-1].strip()
        
        return {"response": clean_response}

    except Exception as e:
        print(f"Error: {e}")
        return JSONResponse(content={"error": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Why I like/prefer the 34B model:

Unlike the 7B or 13B models, Yi-34B has a much deeper “world understanding.”

Complex Instruction Following: If you ask it to “Extract all the text in the red box and format it as JSON,” it will likely succeed where the 7B model would hallucinate or just give raw text.

OCR (Optical Character Recognition): It is excellent at reading dense text, charts, and documents compared to smaller models.

Nuance: It can explain why something is funny or unusual in an image, rather than just listing objects.

LLaVA-NeXT uses a technique called AnyRes, which allows it to see high-resolution images.

Old LLaVA (v1.5): Resized everything to a blurry 336x336 square.

This Model (v1.6): Dynamically slices the image into patches. It can see up to 4x more pixels than v1.5. This means it can spot small text or details in the corner of a large diagram.

Being able to simply asking it to “Think step by step” in the prompt. The 34B model is smart enough to use Chain-of-Thought reasoning.

1 Like

These models are pretty old now - there are better and faster models available.

The SOTA vision models now are:

  • Qwen3-VL series
  • GLM-4.6V
  • Deepseek-OCR - this one is suitable for OCR/document parsing only

And no need to create a custom Python endpoint - vLLM works just fine with all of them (well, Deepseek-OCR is a bit special here). Or llama.cpp for Qwen3-VL series.

Yes, I would give it a try with nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD, it’s quite decent. Even though it looks like NVFP4 support is still lacking, it’s worth trying out.

https://forums.developer.nvidia.com/t/running-nvidia-nemotron-nano-vl-12b-v2-nvfp4-qad-on-your-spark/350349/5

You right, I’ve been toying with Qwen3-VL but haven’t had the time to create a image/container for the Spark yet. That’s my next mini project

You don’t need to create a separate image or a python wrapper for each model. Just use vllm or sglang container (either from NVidia or one of ours here in the forum) and load models there.

Also, no need to use BF16 weights - use one of the quantized versions. FP8/AWQ 8-bit would be pretty much identical in terms of quality, but perform 2x faster and 2x smaller in size. In most cases, FP4/AWQ 4-bit will have very little loss of quality, but will take 4x less space and perform 4x faster than the original 16-bit ones.