Hi everyone,
I developed this Docker image to support a unified workflow for vLLM inference while simultaneously performing Unsloth training. I’m sharing it here in hopes that it might be helpful to others who are looking for a pre-configured environment for similar tasks, especially on the new Blackwell architecture.
This image is specifically tuned for the NVIDIA Blackwell (GB10) architecture. It addresses common kernel compilation issues and compatibility hurdles that often arise when setting up these powerful tools on the newest hardware.
🏠 Repository
-
Docker Hub:
gogamza/unsloth-vllm-gb10:latest
🚀 Key Features and Blackwell (SM 10.0) Optimizations
-
Dual-Purpose Design: Seamlessly switch between or run high-speed Unsloth fine-tuning and high-throughput vLLM inference.
-
Custom-Patched vLLM: Built from source with specific patches to resolve Blackwell-related kernel errors, ensuring a stable environment for SM 10.0 nodes.
-
vLLM V1 Engine & FlashInfer: Pre-configured to use the latest vLLM V1 engine (
VLLM_USE_V1=1) and FlashInfer attention backend for maximum performance. -
FP4 Precision Support: Ready for high-efficiency MoE (Mixture of Experts) models with
VLLM_USE_FLASHINFER_MOE_FP4support. -
Unsloth Integration: Includes the full Unsloth stack (
unsloth,unsloth_zoo,qwen-vl-utils) for faster training and efficient VRAM usage. -
Offline Support: Pre-cached
tiktokenencodings (o200k,cl100k) are included for reliability in enterprise or air-gapped clusters.
🛠 Tech Stack
-
Base Image:
nvcr.io/nvidia/pytorch:25.09-py3 -
CUDA: 13.0
-
Environment Settings: Pre-tuned NCCL IB settings and CUDA Graph mode (
full_and_piecewise) specifically for Blackwell.
💻 Quick Start
docker run --gpus all -it --rm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
gogamza/unsloth-vllm-gb10:latest
I’ve been using this setup for my own VLM/LLM research on Blackwell nodes, and it has significantly simplified my environment management. I would love to hear if this works for your specific use cases or if you have any suggestions for further Blackwell optimizations!
Best regards,