Day 1 with DGX Spark (Asus version)

I was getting this while I was making my setup before it got solid.

You should run ‘ray status’ before starting vllm and see two nodes. In my case, there was a networking configuration error that kept them from speaking correctly. In Eugr’s configuration, you pass in both the ethernet interface to use (your 10g ethernet) and the connectx7 (fast one). These both must be working and set correctly.

I’ve actually found a source of this bug - see PSA: Cluster inference hangs with one node at 100% and another on 0% · Issue #24 · eugr/spark-vllm-docker · GitHub

You probably pulled when the fix wasn’t pushed yet, but you can either rebuild using wheels or pull latest changes and run with --rebuild-deps to make sure it uses correct pytorch version. But for gpt-oss, mxfp4 build would still be the best.

1 Like

Good afternoon @eugr, 2x nodes, GPT-OSS-120b, mxfp4 build works perfectly! Ty so much for helping me (& everyone else). This was my first successful multi-node run and validates the rest of my setup.

For others…

Build:

./build-and-copy.sh \
    --exp-mxfp4 \
    --rebuild-deps \
    --tag harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130

Launch:

./launch-cluster.sh \
    --name eugr-vllm-cluster \
    -t harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130 \
    exec \
    vllm serve \
    /models/gpt-oss-120b \
    --port=8000 \
    --host=0.0.0.0 \
    --gpu-memory-utilization=0.7 \
    -tp 2 \
    --distributed-executor-backend ray \
    --load-format fastsafetensors

Full log:

Auto-detecting interfaces...
  Detected IB_IF: rocep1s0f0,roceP2p1s0f0
  Detected ETH_IF: enp1s0f0np0
  Detected Local IP: 192.168.100.10 (192.168.100.10/31)
Auto-detecting nodes...
  Scanning for SSH peers on 192.168.100.10/31...
  Found peer: 192.168.100.11
  Cluster Nodes: 192.168.100.10,192.168.100.11
Head Node: 192.168.100.10
Worker Nodes: 192.168.100.11
Container Name: eugr-vllm-cluster
Image Name: harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.100.11: OK
Starting Head Node on 192.168.100.10...
144664591dc536b1569ac5d63a6446a53af5a78a9732c0ebb97da8c30a853470
Starting Worker Node on 192.168.100.11...
cac6cb26f11d2a43a486891f8215204b06a1ccf11078933fcf31dc690e14ad8b
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command on head node: vllm serve /models/gpt-oss-120b --port=8000 --host=0.0.0.0 --gpu-memory-utilization=0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
[2026-01-30 21:57:50] INFO font_manager.py:1639: generated new fontManager
(APIServer pid=1161) INFO 01-30 21:57:50 [api_server.py:1278] vLLM API server version 0.1.dev12774+g459541683.d20260130
(APIServer pid=1161) INFO 01-30 21:57:50 [utils.py:253] non-default args: {'model_tag': '/models/gpt-oss-120b', 'host': '0.0.0.0', 'model': '/models/gpt-oss-120b', 'load_format': 'fastsafetensors', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7}
(APIServer pid=1161) INFO 01-30 21:57:54 [model.py:538] Resolved architecture: GptOssForCausalLM
(APIServer pid=1161) INFO 01-30 21:57:54 [model.py:1531] Using max model len 131072
(APIServer pid=1161) WARNING 01-30 21:57:54 [vllm.py:1439] Current vLLM config is not set.
(APIServer pid=1161) INFO 01-30 21:57:54 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1161) INFO 01-30 21:57:54 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=1161) INFO 01-30 21:57:54 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=1161) INFO 01-30 21:57:55 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1161) INFO 01-30 21:57:55 [config.py:307] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=1161) WARNING 01-30 21:57:55 [vllm.py:621] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=1161) INFO 01-30 21:57:55 [vllm.py:640] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:00 [core.py:96] Initializing a V1 LLM engine (v0.1.dev12774+g459541683.d20260130) with config: model='/models/gpt-oss-120b', speculative_config=None, tokenizer='/models/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=/models/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:00 [ray_utils.py:334] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
(EngineCore_DP0 pid=1336) 2026-01-30 21:58:00,241       INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 192.168.100.10:6379...
(EngineCore_DP0 pid=1336) 2026-01-30 21:58:00,248       INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at http://192.168.100.10:8265
(EngineCore_DP0 pid=1336) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=1336)   warnings.warn(
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:01 [ray_utils.py:399] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:01 [ray_utils.py:210] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node afc91b073d4242a859cd217499b587964ac4400eb5dfeb77be84eb1c. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:01 [ray_utils.py:210] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node da111a2f2abe0db01e307210a7ddae789ed3b94511057fd6a4bac3ac. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:66] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:69] Copying the following environment variables to workers: ['CUDA_HOME', 'LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:74] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:07 [system_utils.py:36] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:08 [worker_base.py:296] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:09 [parallel_state.py:1214] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.100.10:36153 backend=nccl
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:09 [pynccl.py:111] vLLM is using nccl==2.28.9
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:10 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:10 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:10 [parallel_state.py:1425] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:11 [gpu_model_runner.py:3804] Starting to load model /models/gpt-oss-120b...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [flashinfer.py:334] SM12x detected - using native FlashInfer CUTLASS attention instead of TRT-LLM attention (cubins not available for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [mxfp4.py:223] [MXFP4] Auto-selected: cutlass (FlashInfer CUTLASS FP8×FP4 for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:07 [system_utils.py:36] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:09 [worker_base.py:296] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:09 [parallel_state.py:1214] world_size=2 rank=1 local_rank=0 distributed_init_method=tcp://192.168.100.10:36153 backend=nccl
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:10 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:10 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:10 [parallel_state.py:1425] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:03<00:21,  3.03s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:05<00:14,  2.46s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:07<00:11,  2.27s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:08<00:08,  2.05s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:10<00:06,  2.03s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:12<00:04,  2.03s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:38 [default_loader.py:291] Loading weights took 14.52 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [flashinfer.py:334] SM12x detected - using native FlashInfer CUTLASS attention instead of TRT-LLM attention (cubins not available for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [mxfp4.py:223] [MXFP4] Auto-selected: cutlass (FlashInfer CUTLASS FP8×FP4 for SM12x)
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:14<00:01,  1.93s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:16<00:00,  1.84s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:16<00:00,  2.03s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:41 [gpu_model_runner.py:3901] Model loading took 33.0565 GiB memory and 30.153112 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:46 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/7899cc7b9c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:46 [backends.py:704] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:40 [default_loader.py:291] Loading weights took 16.25 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) [rank0]:W0130 21:58:47.428000 1459 torch/_inductor/utils.py:1725] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:50 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:41 [gpu_model_runner.py:3901] Model loading took 33.0565 GiB memory and 30.179833 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:10 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 83.25 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:10 [monitor.py:34] torch.compile takes 87.53 s in total
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:46 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/7899cc7b9c/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:46 [backends.py:704] Dynamo bytecode transform time: 4.28 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:50 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:12 [gpu_worker.py:356] Available KV cache memory: 45.750000 GiB
(EngineCore_DP0 pid=1336) INFO 01-30 22:00:12 [kv_cache_utils.py:1305] GPU KV cache size: 1,332,688 tokens
(EngineCore_DP0 pid=1336) INFO 01-30 22:00:12 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 20.00x
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:14 [utils.py:475] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) 2026-01-30 22:00:20,146 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) 2026-01-30 22:00:20,166 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) [rank1]:W0130 21:58:47.479000 383 torch/_inductor/utils.py:1725] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:20 [kernel_warmup.py:64] Warming up FlashInfer attention.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:10 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 83.28 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:10 [monitor.py:34] torch.compile takes 87.49 s in total
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:12 [gpu_worker.py:356] Available KV cache memory: 45.950000 GiB
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:20 [utils.py:475] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/83 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) 2026-01-30 22:00:20,147 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) 2026-01-30 22:00:20,160 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends

(REMOVED MODEL LOADING ENTRIES...)

(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:01:08 [gpu_model_runner.py:4856] Graph capturing finished in 12 secs, took 4.62 GiB
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:20 [kernel_warmup.py:64] Warming up FlashInfer attention.
(EngineCore_DP0 pid=1336) INFO 01-30 22:01:10 [core.py:273] init engine (profile, create kv cache, warmup model) took 148.67 seconds
(EngineCore_DP0 pid=1336) INFO 01-30 22:01:11 [vllm.py:640] Asynchronous scheduling is disabled.
(APIServer pid=1161) INFO 01-30 22:01:11 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=1161) WARNING 01-30 22:01:11 [serving_responses.py:222] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1161) INFO 01-30 22:01:12 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=1161) INFO 01-30 22:01:12 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1161) INFO 01-30 22:01:12 [serving_chat.py:216] Chat template warmup completed in 509.7ms
(APIServer pid=1161) INFO 01-30 22:01:12 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:38] Available routes are:
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1161) INFO:     Started server process [1161]
(APIServer pid=1161) INFO:     Waiting for application startup.
(APIServer pid=1161) INFO:     Application startup complete.
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:535] RAY_CGRAPH_get_timeout is set to 300
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:539] VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE = auto
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:543] VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM = False
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:602] Using RayPPCommunicator (which wraps vLLM _PP GroupCoordinator) for Ray Compiled Graph communication.
(APIServer pid=1161) INFO 01-30 22:02:32 [loggers.py:257] Engine 000: Avg prompt throughput: 15.8 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO 01-30 22:02:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO:     127.0.0.1:53698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1161) INFO 01-30 22:02:52 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO 01-30 22:03:02 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

As I mentioned earlier I thought the issue may have been my RoCE configuration so I am sharing /etc/netplan/40-cx7.yaml here in case anyone finds it useful:

Master

network:
  version: 2
  renderer: networkd

  ethernets:
    # RoCE Link A (fabric A)
    enp1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: []
      addresses:
        - 192.168.100.10/31
      optional: true

    # Unused / unplugged port
    enp1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: []
      optional: true

    # RoCE Link B (fabric B)
    enP2p1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: []
      addresses:
        - 192.168.101.0/31
      optional: true

    # Unused / unplugged port
    enP2p1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: []
      optional: true

Worker

network:
  version: 2
  renderer: networkd

  ethernets:
    # RoCE Link A (fabric A)
    enp1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: []
      addresses:
        - 192.168.100.11/31
      optional: true

    # Unused / unplugged port
    enp1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: []
      optional: true

    # RoCE Link B (fabric B)
    enP2p1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: []
      addresses:
        - 192.168.101.1/31
      optional: true

    # Unused / unplugged port
    enP2p1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: []
      optional: true
1 Like

Week 1 with DGX Spark:

  1. Hail eugr “the God”

  2. GPT-OSS is SOTA for non-English users. Other models tend to have difficulty accepting languages other than English. In contrast, GPT-OSS produces remarkably competent results in both comprehension and generation, even at 20B.
    I noticed this even when running small models like Qwen3-8B, though back then I was skeptical whether it would remain stable beyond 20B. After getting the Spark and running mid-to-large models (>30B), it became much clearer. It’s not parameter-dependent—GPT-OSS is just that good.

  3. Ultimately, what I understand is that the advantage of DGX Spark is that it can run large models(>100B), albeit slowly. For this, MX/NVFP4 quantized models are essential. In particular, it seems that the utility of this device will increase as more MX/NVFP4 QAT models are produced.
    One limitation is that we cannot quantize large models distributed in BF16 or FP8 into NVFP4 with a single Spark. While there’s no obligation to do so, I think if NVIDIA were to quantize and share more large models to NVFP4 or MXFP4, Spark’s advantages would appeal to general users as well.

After testing various models and configurations, we are currently providing GPT-OSS 20B service to approximately six users simultaneously, and the response has been positive so far(not for vibe coding-only chat service via WebUi.). With the arrival of ConnectX-7 cables next week, we anticipate expanding the service to more users using GPT-OSS 120B. Thank you all!

2 Likes

@nvidia.vitality213 where do you define the MTU for the CX7 links since it’s not in the 40-cx7.yaml and the netplan renderer is systemd-networkd? Setting jumbo frames on the CX7 links helps!

1 Like

BTW, while this is permissible with this netmask, I’d avoid using .0 as a host address as it may not play well with some software. And yes, setting MTU to 9000 will help (set it on both “halves” of the interface, even if you don’t use one):

network:
  version: 2
  ethernets:
    enp1s0f1np1:
      dhcp4: no
      dhcp6: no              # Explicitly disable DHCPv6
      link-local: [ ipv4 ]   # Restrict link-local addresses to IPv4 only
      mtu: 9000
      addresses: [192.168.177.11/24]
    enP2p1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: [ ipv4 ]
      mtu: 9000

I think gpt-oss-120b should be able to handle six users even on a single Spark, especially if you use the new MXFP4 optimized option.

Dear blessed God,

Now I’m serving cyankiwi/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit · Hugging Face model with the vllm version from nvcr.io/nvidia/vllm:25.12.post1-py3.

Speculative Decoding enabled, num_speculative_tokens = 2, with gpu_memory_utilization = 0.7

With this configuration, time to first token is very fast.
I think these values are sweet spot for a single spark. Increasing num_speculative_tokens and gpu_memory_utilization normally makes TTFT and gen throughput worse.

Now my teammates are testing whether is it more stable than gpt-oss 20B/120B.

Thank you for your help.

2 Likes

Hi, sorry for late reply.

However I couldn’t answer for your questions at that time, since my Asus spark works fine.
I found some issues like sudden Firefox crashes and system freeze when loading large models.

However, to date, I haven’t observed any instances of system instability due to thermal issues.

My teammate has the foundation version(original dgx spark) and he said the Asus version is much better in terms of thermal issues.

I will share you if any thermal issues occur with my spark.
Thank you.

2 Likes

Thanks for the reply!

Yeah, after a week or so of tinkering I am now unsure if I really saw a thermal event. Having said that, I used this excuse to learn Adobe Fusion, so now I have a 3D-printed design and even bought the fans for it. I’ll report back if I see any material improvement in temps or performance.