Day 1 with DGX Spark (Asus version)

1) Purchase Rationale

  • Due to company policies restricting the use of external LLM services, I purchased this to build a local LLM infrastructure for workflow automation.

  • Before purchasing, I deployed GPT-OSS 20B and 120B models on idle servers using vLLM, which confirmed the value of having local LLM services.

  • I arranged a demonstration with the distributor before purchase, testing GPT-OSS 20B with Ollama, which showed satisfactory performance.

  • Finally I purchased 3 units: One for serving GPT-OSS 20B/120B, and two connected via ConnectX-7 to quantize large models with NVFP4 or run them directly.

2) Expectations

※ I’m not from an AI-specialized department, so some misconceptions may have existed.

  • I expected significantly better performance with vLLM or TensorRT-LLM compared to what I saw with Ollama.

  • NVIDIA had already provided NVFP4-converted models on Hugging Face, and I expected stable operation when using them with the official TensorRT-LLM toolkit.

  • I anticipated that qwen3-next-80b-a3b-nvfp4 would provide an optimal balance of size and superior performance.

3) Reality Check (First Day Experience)

  • Poor Playbook Update Tracking: Features I expected to be supported in the latest versions were deprecated, and things only worked with the exact versions specified in the Playbook. (I know that’s the reason why we uses docker. However I didn’t anticipate differences even between 1.2.0rc6 and rc8)

  • Questionable NVFP4 Effectiveness: Further validation is needed, but the first impression was underwhelming. Unable to load qwen3-next, I tried the NVFP4 version of qwen3-30b-a3b, and I suffered from severe hallucinations and system prompts ignoring issues.

    • Note: As a Korean user, I’ve found that smaller models and heavily quantized models tend to respond in mixed Korean+Japanese+Chinese when given Korean input. The performance felt worse than when I tested qwen3-30b-a3b Q4 on an RTX 4090. English users likely won’t encounter this issue. I will deeply dive into it after setting up the sparks.
  • Optimal Combination: After testing various models with TensorRT-LLM and SGLang through WebUI, the Ollama Ɨ GPT-OSS combination showed the best performance. GB10 currently doesn’t deliver 100% performance with sglang or vllm, even with trtllm, and you’ll need to either wait for full SM121 support or participate in development yourself.

4) Conclusion

  • Engineer’s Perspective: This situation is actually interesting. The dopamine rush from wrestling with hardware and finally making it work is something every engineer understands and enjoys.

  • Customer’s Perspective: Honestly, I’m quite disappointed. I expected to unbox the device and immediately see decent performance with the TensorRT-LLM Ɨ NVFP4 combination, starting with ā€œThis is surprisingly good—perfect for small-scale local services,ā€ and then diving into optimization with an engineer’s mindset. Instead, I encountered issues from day one, concluding with ā€œLet’s just use Ollama for now,ā€ which is very disappointing.

  • Community Feedback: I found similar opinions on the forums. Support for GB10 is insufficient and improving too slowly. I should have checked this before purchasing.

  • Still a Valuable Device: Despite everything, this is undeniably valuable hardware. At this price point, DGX Spark-class devices are practically the only option besides used PCs for running GPT-OSS 120B.

  • Day 1 Review Disclaimer: Some may think, ā€œIf you can’t even handle basic optimization, isn’t that your problem? Why complain?ā€ I completely understand that perspective, and I acknowledge I still have much to learn. However, please keep in mind that this is purely a Day 1 review.

1 Like

Welcome to Spark. It’s better than it was on launch and it will get better. But ā€œit just worksā€ does not apply. That slogan doesn’t seem to apply to Apple anymore, at least in some of my recent experiences. It Nvidia launches a consumer PC this year, as is rumored, we’ll see how capable they are with the great unwashed masses.

Use the forum to your advantage. You will learn a lot and you will be able to do what you need to, probably.

2 Likes

Personally, I wouldn’t use GPT OSS 20B. 120B works great on one node. I can imagine it working extremely well with concurrency with vLLM on 3 nodes.

4bit AWQ currently seems to be better than NVFP4 (Nvidia team has some serious work to do here).

It’s cutting edge + new + released last year + expensive, so there won’t be much of a support.

1 Like

From your experience, is the 4bit AWQ just faster than NVFP4, better quality, or both?

Here is an excellent YouTube Video that actually benchmarks the DGX Spark against other leading competitors. This is one of the first videos that shows the true power of the Spark.

3 Likes

I watched this one yesterday. There are some open questions, like whether his benchmarking tool makes sure that the prompt is not hitting the cache, for once. I mean, Spark has better GPU than current Macs and Strix Halo, so it is definitely faster in concurrency scenarios, but his numbers seemed to be off.

Also, for his llama.cpp concurrency test - looks like he didn’t set max concurrency - I believe it defaults to 4, so no wonder it was the limit.

There were some other issues with this video that I don’t remember, but I don’t want to watch it again.

2 Likes

Short answer: both.
Long answer: it depends.

Quality wise, most NVFP4 quants in the wild are W4A4, which means that activation weights are also quantized to FP4. AWQ keeps activation weights at full precision, so the quality is better.

There are some NVFP4 W4A16 quants - with these it depends on how the quantization was done and whether a calibration dataset was used. FP4 carries more precision than INT4, so in theory, everything else being equal, such quants should have better accuracy than AWQ ones.

Now, AWQ is a mature technology, and vLLM kernels are well optimized for this. FP4 is newer, and specifically on Spark the support is lagging behind ā€œbig Blackwellā€, so currently it’s slower than AWQ.

I don’t know whether that would change anytime soon, but for instance, our forum member @christopher_owen is doing great work optimizing MXFP4 performance on Spark - you can check his progress here: vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? - #18 by christopher_owen

8 Likes

Yeah sorry that wasn’t clear. There was another post here that implies pretty strongly that 4bit AWQ is a bit faster. As for quality, from what I understand from the papers, they should be about the same with different pros & cons, but because the LLM world moves so fast, it’s really hard to know and hard to answer these type of questions.

1 Like

Another Asus Ascent buyer here. Welcome to the team.

I got the Asus one as it was the least expensive available where I live. I discovered after purchasing it that it does not have a vapor chamber, even though its website did day it had one (at least the cached version of the website did show this as Google still indexes it in a typical search).

Have you noticed any thermal issues with yours? I tried stress testing mine by running a simple ComfyUI run non-stop, creating an image and throwing it away. I left the machine running and came back to it about 30 minutes later, only to discover it had shut down. I suspect it was a thermal event but can’t confirm it yet.

I’m thinking of building a 3D-printable cooling solution throwing forced air on the unit’s air intake. Anyone has tried something like that already?

Giant thank you for sharing your experience … I wanted to also weigh in that overall my experience after a week with my new 2x DGX Spark machines (founders edition) has been awful.

Out of the box, the initial setup bricked both devices and I needed to do a system restore on each. Thankfully that seemed to fix whatever was wrong … how did this get past QC?? As a side note, I setup one over wifi and the other with a kb/mouse … both failed in the exact same way.

The Nvidia-provided VLLM containers worked OK but the model support is very limited. Generally, stated models did run in VLLM … unfortunately I can run these ~twice as fast on my 4-year-old RTX A6000s. To be honest I haven’t spent time on unsupported models because at this moment, the ecosystem has been very fragile for me.

Lurking in the forums here, I am now dreading the cooling / power issues I see some others having.

Also - why in the isn’t there a single LED or any external activity indicator on the Spark? This is completely unacceptable for a $4k device and IMO a giant problem for folks running headless.

I really want to love these things but holy moly I feel like I’ve hit every major problem that I wanted to avoid investing in a purpose-built device from Nvidia.

At this moment, I deeply regret my purchase. I was hoping for hardware that would accelerate my research and AI learning path for work but now I’m going to sink weeks / months of my limited time into cherry picking PRs and maintaining custom builds of boilerplate software.

can you share more about the initial setup and what did you do differently after a system restore? more information/ context helps.

Which model, container and playbook were you following? can you share specific metrics you are seeing?

1 Like

For vLLM, just use our community vLLM build: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
It is using latest vLLM and is optimized for dual Sparks.

In terms of performance, Spark has 273GB/s memory bandwidth, so pretty much any dedicated GPU will beat it, but only if the model fully fits into VRAM. Spark advantage is having 128GB of unified memory that GPU can use. However, when properly configured, dual Spark setup will give up to 1.9x performance of a single one.

2 Likes

Hello @NVES , thanks for the reply! The initial setup was pretty cookie cutter. For the first one I connected to the DGX via WiFi, which worked perfectly and I entered the admin account and went through the steps. Once I restarted, the DGX dropped to an error screen that told me to restart.

Apologies, I don’t remember the exact verbiage, but it didn’t have any specific messages. It was just like a general system error and asked me to restart again. I did am Internet search and found a few other folks who had the same issue and they basically said to go through the restore process, which did work for me. For my second DGX setup I hardwired to a monitor and KB but the same thing happened … again a full restore fixed the issue.

RE performance:

I don’t have any proper benchmarks at this point … as mentioned I’ve been spending my time hacking to figure out the best performing model I can with with my setup. Below are notes from tests that I’ve run so far.

Single node

Multi-Node

I haven’t been able to get this setup working at all. I do have another post in this forum indicating that the dual node setup for Qwen3 didn’t work for me. I will reply back when I get something working here … again, I have limited time and decided to go with the DGX Spark because I thought it would be an easier setup with some more or less turnkey options available.

Other Attempts

  • docker.io/vllm/vllm-openai:v0.14.1-aarch64-cu130, Llama-3.3-70B-Instruct: Failed to load … consumed all system memory and I had to hard power cycle
  • scitrera/dgx-spark-vllm:0.14.1-t4, Llama-3.3-70B-Instruct-NVFP4: Failed to load … ā€œValueError: No valid attention backend found for cuda with AttentionSelectorConfigā€

Thanks again for your reply, I’m a bit frustrated for sure but I do appreciate you (Nvidia) actively helping folks like me get up and running.

Good morning @eugr giant thank you for the GitHub reference … I’m going to test christopherowen/spark-vllm-mxfp4-docker this AM and then I’ll try your repo. I’ll let you know how it goes … I see how active you are in this community, ty again for your contributions!

1 Like

Just keep in mind, that Christopher’s version is designed to optimize for gpt-oss-120b only and may not work with other models. Also, I’ve incorporated these changes into my repo, but in a separate ā€œmxfp4ā€ branch so far, since it’s not working in the cluster yet. Christopher made some changes - I’m building it right now, and if I confirm that it’s fixed, I’ll push this into main branch.

1 Like

The anticipation is killing me ā˜ŗļø

BTW, I’ve merged mxfp4 branch with main - you are welcome to try. Please follow the instructions in the changelog.

1 Like

Good morning @eugr I’ve been doing some testing with GPT-OSS-120b and your build of VLLM works perfectly for single node for me. Huge thank you on this!!

I almost hate to ask, but if you have the time, I would be very appreciative to get your insight on the multi-node error I’m seeing. Basically, when running an inference test in a 2 node cluster, it looks like VLLM starts processing … I see the GPU activity on the master spike to 100%; additionally on the worker GPU activity spikes to 100% as well but only for a short period. Essentially, after the GPU activity goes down to zero percent on my worker, the master GPU remains at 100% until about five minutes … I see this in the vllm logs (notice generation throughput goes to 0):

(APIServer pid=1024) INFO 01-30 12:44:54 [loggers.py:257] Engine 000: Avg prompt throughput: 15.8 tokens/s, Avg generation throughput: 23.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1024) INFO 01-30 12:45:04 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1024) INFO 01-30 12:45:14 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

After about 5 min the vllm process crashes with this trace:

(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.15.0rc2.dev80+ga5aa4d5c0.d20260129) with config: model='/models/gpt-oss-120b', speculative_config=None, tokenizer='/models/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-b1786ea3f5e06d3c-b237d30b'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[466],num_output_tokens=[309]), num_scheduled_tokens={chatcmpl-b1786ea3f5e06d3c-b237d30b: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 30], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0002396225023961751, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     raise e
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 612, in read
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     output = self._buffers[self._next_read_index].read(timeout)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 976, in get_objects
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "python/ray/_raylet.pyx", line 2875, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "python/ray/includes/common.pxi", line 124, in ray._raylet.check_status
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 00d6d592397561444345a99b5f2ce2efa7534890010000000be1f505
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     self._process_engine_step()
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 389, in step
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     model_output = self.model_executor.sample_tokens(grammar_output)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 431, in sample_tokens
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     return self._execute_dag(scheduler_output, grammar_output, non_block)
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 449, in _execute_dag
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     return refs[0].get()
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]            ^^^^^^^^^^^^^
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     self._dag._execute_until(
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=1084) ERROR 01-30 12:49:56 [core.py:948] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.
(EngineCore_DP0 pid=1084) INFO 01-30 12:49:56 [ray_executor.py:120] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP0 pid=1084) 2026-01-30 12:49:56,207       INFO compiled_dag_node.py:2167 -- Tearing down compiled DAG
(EngineCore_DP0 pid=1084) 2026-01-30 12:49:56,208       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 4345a99b5f2ce2efa753489001000000)
(EngineCore_DP0 pid=1084) 2026-01-30 12:49:56,208       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, d5c932163e63f041a849d9e001000000)
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693] AsyncLLM output_handler failed.
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693] Traceback (most recent call last):
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 649, in output_handler
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693]     outputs = await engine_core.get_output_async()
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 894, in get_output_async
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693]     raise self._format_exception(outputs) from None
(APIServer pid=1024) ERROR 01-30 12:49:56 [async_llm.py:693] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1024) INFO:     127.0.0.1:43390 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=1084) 2026-01-30 12:49:56,211       INFO compiled_dag_node.py:2194 -- Waiting for worker tasks to exit
(APIServer pid=1024) INFO:     Shutting down

I’m using this command to start the vllm cluster:

./launch-cluster.sh \
    --name eugr-vllm-cluster \
    -t harbor.k8s.wm.k8slab/dgx/eugr-vllm:latest \
    exec \
    vllm serve \
    /models/gpt-oss-120b \
    --port=8000 \
    --host=0.0.0.0 \
    --gpu-memory-utilization=0.7 \
    -tp 2 \
    --distributed-executor-backend ray \
    --load-format fastsafetensors

On startup, everything looks okay to me:

Auto-detecting interfaces...
  Detected IB_IF: rocep1s0f0,roceP2p1s0f0
  Detected ETH_IF: enp1s0f0np0
  Detected Local IP: 192.168.100.10 (192.168.100.10/31)
Auto-detecting nodes...
  Scanning for SSH peers on 192.168.100.10/31...
  Found peer: 192.168.100.11
  Cluster Nodes: 192.168.100.10,192.168.100.11
Head Node: 192.168.100.10
Worker Nodes: 192.168.100.11
Container Name: eugr-vllm-cluster
Image Name: harbor.k8s.wm.k8slab/dgx/eugr-vllm:latest
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.100.11: OK
Starting Head Node on 192.168.100.10...
3ce2e35970e368294e6865b051ed4b6cb9764918407cf8d8b50f28eb82f87a2b
Starting Worker Node on 192.168.100.11...
725b948d49d4bcf53c66397d03c002b2e44ac6b7f7d855ad7ac0e1816098302f

I do see a startup error regarding the Triton kernel, but I feel like this isn’t really related to my problem because please shout back if my intuition is wrong:

ERROR 01-30 12:28:32 [gpt_oss_triton_kernels_moe.py:34] Failed to import Triton kernels. Please make sure your triton version is compatible. Error: No module named 'triton_kernels.routing'

Regarding vllm/ray, I do see this … I’ve never run a multi-node configuration before, so to be honest, I’m not sure if this is expected behavior or indicating an issue. Like I said above, the model loads correctly on both nodes:

(EngineCore_DP0 pid=1083) 2026-01-30 12:28:32,449       INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 192.168.100.10:6379...
(EngineCore_DP0 pid=1083) 2026-01-30 12:28:32,457       INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at http://192.168.100.10:8265
(EngineCore_DP0 pid=1083) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=1083)   warnings.warn(
(EngineCore_DP0 pid=1083) INFO 01-30 12:28:32 [ray_utils.py:402] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=1083) WARNING 01-30 12:28:32 [ray_utils.py:213] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node fc107f9265520cec8ddfda10b91a8e1d174d9d3e43a38a9f6bd64b24. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_DP0 pid=1083) WARNING 01-30 12:28:32 [ray_utils.py:213] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 34518905dfd0291243ce95d4b2a9c01d855a413700a8a57d2caafc8a. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.

Part of me feels like this is related to my QSPF setup … I did followed the quick start guide exactly with the exception of trying to put each RoCE link in its own subnet (in case it was a routing issue). Here is my current net config:

Master

# QSPF
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

# IPs
lo               UNKNOWN        127.0.0.1/8 ::1/128 
enP7s7           UP             10.0.1.118/24 fe80::a076:3ac2:4aaf:2728/64 
enp1s0f0np0      UP             192.168.100.10/31 
enp1s0f1np1      DOWN           
enP2p1s0f0np0    UP             192.168.101.0/31 
enP2p1s0f1np1    DOWN           
wlP9s9           UP             10.0.1.105/24 fe80::8b8b:f189:233e:1a4a/64 
docker0          UP             172.17.0.1/16 fe80::7c37:b9ff:fe69:51b3/64 
veth147f954@if2  UP             fe80::4c68:48ff:fe4e:2c77/64

# Routes
default via 10.0.1.1 dev enP7s7 proto dhcp src 10.0.1.118 metric 100 
default via 10.0.1.1 dev wlP9s9 proto dhcp src 10.0.1.105 metric 600 
10.0.1.0/24 dev enP7s7 proto kernel scope link src 10.0.1.118 metric 100 
10.0.1.0/24 dev wlP9s9 proto kernel scope link src 10.0.1.105 metric 600 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
192.168.100.10/31 dev enp1s0f0np0 proto kernel scope link src 192.168.100.10 
192.168.101.0/31 dev enP2p1s0f0np0 proto kernel scope link src 192.168.101.0

Worker

# QSPF
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

# IPs
lo               UNKNOWN        127.0.0.1/8 ::1/128 
enP7s7           UP             10.0.1.188/24 fe80::3844:a617:8197:a32d/64 
enp1s0f0np0      UP             192.168.100.11/31 
enp1s0f1np1      DOWN           
enP2p1s0f0np0    UP             192.168.101.1/31 
enP2p1s0f1np1    DOWN           
wlP9s9           UP             10.0.1.150/24 fe80::c776:4132:665d:e059/64 
docker0          UP             172.17.0.1/16 fe80::24b7:83ff:fe82:e147/64 
vethb385fc3@if2  UP             fe80::44c2:4fff:febb:ad2e/64

# Routes
default via 10.0.1.1 dev enP7s7 proto dhcp src 10.0.1.188 metric 100 
default via 10.0.1.1 dev wlP9s9 proto dhcp src 10.0.1.150 metric 600 
10.0.1.0/24 dev enP7s7 proto kernel scope link src 10.0.1.188 metric 100 
10.0.1.0/24 dev wlP9s9 proto kernel scope link src 10.0.1.150 metric 600 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
192.168.100.10/31 dev enp1s0f0np0 proto kernel scope link src 192.168.100.11 
192.168.101.0/31 dev enP2p1s0f0np0 proto kernel scope link src 192.168.101.1

My apologies in advance for so much information here. Again, I appreciate the help and insight, as well as the container that you’ve created. Like I said, that’s already helped me out quite a bit and saved me a bunch of time. So ty again!

How do you build the image? Are you using the new mxfp4 one for gpt-oss?

Are you using the main branch of the repo? The cluster setup was not working when I first integrated Chris’s build in my Docker, but I fixed that in main.

There were some reports about hanging nodes with my regular build too recently - not sure what are they related to. I rolled back Triton version and bumped up base CUDA image - let’s see if it helps.

Good afernoon @eugr

Previously, I was trying the container from just before your mxfp4 update.

I just did a pull / rebuild and am (more-or-less) seeing the same behavior … I see activity on both the master worker GPUs, this time master went quiet and process died after ~5min:

= False
(EngineCore_DP0 pid=1083) INFO 01-30 19:08:03 [ray_executor.py:601] Using RayPPCommunicator (which wraps vLLM _PP GroupCoordinator) for Ray Compiled Graph communication.
(APIServer pid=1023) INFO 01-30 19:08:12 [loggers.py:257] Engine 000: Avg prompt throughput: 7.0 tokens/s, Avg generation throughput: 8.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1023) INFO 01-30 19:08:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.0rc1.dev29+gf451b4558.d20260130) with config: model='/models/gpt-oss-120b', speculative_config=None, tokenizer='/models/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-8aaaa94d159c3a4f-b29cdfab'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[358],num_output_tokens=[201]), num_scheduled_tokens={chatcmpl-8aaaa94d159c3a4f-b29cdfab: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 23], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.00019656381874355588, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2525, in _execute_until
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     result = self._dag_output_fetcher.read(timeout)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 312, in read
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     outputs = self._read_list(timeout)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]               ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 403, in _read_list
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     raise e
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/common.py", line 385, in _read_list
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     result = c.read(min(remaining_timeout, iteration_timeout))
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 776, in read
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     return self._channel_dict[self._resolve_actor_id()].read(timeout)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 612, in read
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     output = self._buffers[self._next_read_index].read(timeout)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/channel/shared_memory_channel.py", line 480, in read
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     ret = self._worker.get_objects(
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]           ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 976, in get_objects
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     ] = self.core_worker.get_objects(
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "python/ray/_raylet.pyx", line 2875, in ray._raylet.CoreWorker.get_objects
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "python/ray/includes/common.pxi", line 124, in ray._raylet.check_status
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] ray.exceptions.RayChannelTimeoutError: System error: Timed out waiting for object available to read. ObjectID: 00d6d59239756144ebcd09945b34d4fdf699a2ec0100000003e1f505
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     self._process_engine_step()
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 389, in step
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     model_output = self.model_executor.sample_tokens(grammar_output)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 431, in sample_tokens
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     return self._execute_dag(scheduler_output, grammar_output, non_block)
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 449, in _execute_dag
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     return refs[0].get()
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]            ^^^^^^^^^^^^^
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 115, in get
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     self._dag._execute_until(
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/ray/dag/compiled_dag_node.py", line 2535, in _execute_until
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948]     raise RayChannelTimeoutError(
(EngineCore_DP0 pid=1083) ERROR 01-30 19:13:09 [core.py:948] ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds. Otherwise, this may indicate that the execution is hanging.
(EngineCore_DP0 pid=1083) INFO 01-30 19:13:09 [ray_executor.py:120] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:09,216       INFO compiled_dag_node.py:2167 -- Tearing down compiled DAG
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:09,216       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ebcd09945b34d4fdf699a2ec01000000)
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:09,216       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 8b7e0c3ddba7c6a41365a76501000000)
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681] AsyncLLM output_handler failed.
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681] Traceback (most recent call last):
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 637, in output_handler
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681]     outputs = await engine_core.get_output_async()
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 894, in get_output_async
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681]     raise self._format_exception(outputs) from None
(APIServer pid=1023) ERROR 01-30 19:13:09 [async_llm.py:681] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1023) INFO:     127.0.0.1:37238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:09,220       INFO compiled_dag_node.py:2194 -- Waiting for worker tasks to exit
(APIServer pid=1023) INFO:     Shutting down
(APIServer pid=1023) INFO:     Waiting for application shutdown.
(APIServer pid=1023) INFO:     Application shutdown complete.
(APIServer pid=1023) INFO:     Finished server process [1023]
*** SIGTERM received at time=1769800392 on cpu 19 ***
PC: @     0xfcbdf7071ea0  (unknown)  (unknown)
    @     0xfcbca5aad178        464  absl::lts_20230802::AbslFailureSignalHandler()
    @     0xfcbdf732d968   22082384  (unknown)
    @     0xfcbdf7074e50        176  pthread_cond_timedwait
    @     0xfcbca4ff457c         96  ray::core::GetRequest::Wait()
    @     0xfcbca4ff6704       1184  ray::core::CoreWorkerMemoryStore::GetImpl()
    @     0xfcbca4ff714c       1344  ray::core::CoreWorkerMemoryStore::Get()
    @     0xfcbca4ff73a0         32  ray::core::CoreWorkerMemoryStore::Get()
    @     0xfcbca4e9bee8        208  ray::core::CoreWorker::GetObjects()
    @     0xfcbca4e9c6e4       1472  ray::core::CoreWorker::Get()
    @     0xfcbca4deb528        176  __pyx_pw_3ray_7_raylet_10CoreWorker_39get_objects()
    @           0x4c4a0c        224  PyObject_Vectorcall
    @   0x22000000564494         32  (unknown)
    @   0x550000005627e4        368  (unknown)
    @   0x5500000059b800        128  (unknown)
    @   0x7c00000067e6a4         80  (unknown)
    @   0x6c00000068ae60         32  (unknown)
    @   0x1f00000068a968        272  (unknown)
    @   0x48fcbdf70184c4        304  (unknown)
    @     0xfcbdf7018598         16  __libc_start_main
[2026-01-30 19:13:12,824 E 1083 1083] logging.cc:474: *** SIGTERM received at time=1769800392 on cpu 19 ***
[2026-01-30 19:13:12,824 E 1083 1083] logging.cc:474: PC: @     0xfcbdf7071ea0  (unknown)  (unknown)
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca5aad1a0        464  absl::lts_20230802::AbslFailureSignalHandler()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbdf732d968   22082384  (unknown)
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbdf7074e50        176  pthread_cond_timedwait
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4ff457c         96  ray::core::GetRequest::Wait()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4ff6704       1184  ray::core::CoreWorkerMemoryStore::GetImpl()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4ff714c       1344  ray::core::CoreWorkerMemoryStore::Get()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4ff73a0         32  ray::core::CoreWorkerMemoryStore::Get()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4e9bee8        208  ray::core::CoreWorker::GetObjects()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4e9c6e4       1472  ray::core::CoreWorker::Get()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @     0xfcbca4deb528        176  __pyx_pw_3ray_7_raylet_10CoreWorker_39get_objects()
[2026-01-30 19:13:12,826 E 1083 1083] logging.cc:474:     @           0x4c4a0c        224  PyObject_Vectorcall
[2026-01-30 19:13:12,827 E 1083 1083] logging.cc:474:     @   0x22000000564494         32  (unknown)
[2026-01-30 19:13:12,829 E 1083 1083] logging.cc:474:     @   0x550000005627e4        368  (unknown)
[2026-01-30 19:13:12,830 E 1083 1083] logging.cc:474:     @   0x5500000059b800        128  (unknown)
[2026-01-30 19:13:12,831 E 1083 1083] logging.cc:474:     @   0x7c00000067e6a4         80  (unknown)
[2026-01-30 19:13:12,832 E 1083 1083] logging.cc:474:     @   0x6c00000068ae60         32  (unknown)
[2026-01-30 19:13:12,833 E 1083 1083] logging.cc:474:     @   0x1f00000068a968        272  (unknown)
[2026-01-30 19:13:12,834 E 1083 1083] logging.cc:474:     @   0x48fcbdf70184c4        304  (unknown)
[2026-01-30 19:13:12,834 E 1083 1083] logging.cc:474:     @     0xfcbdf7018598         16  __libc_start_main
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:13,221       INFO compiled_dag_node.py:2167 -- Tearing down compiled DAG
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:13,221       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ebcd09945b34d4fdf699a2ec01000000)
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:13,221       INFO compiled_dag_node.py:2172 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 8b7e0c3ddba7c6a41365a76501000000)
(EngineCore_DP0 pid=1083) 2026-01-30 19:13:13,225       INFO compiled_dag_node.py:2194 -- Waiting for worker tasks to exit

I do want to say that Triton error is gone … only thing that looks suspicious is I still have this in my logs (I have no idea whether or not this is normal):

(EngineCore_DP0 pid=1083) WARNING 01-30 19:05:46 [ray_utils.py:337] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.

I’m going to give the mxfp4 version a try now, will report back.