Enginecore Failure or Memory Profiling Issues when launching gemma 4-26B-A4B on two sparks

Hi so i got enginecore failure when I try to launch gemma 4-26B-A4B on two nodes, the two main errors are ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method()

After that i try to fix the issue by explicitly disabling the unstable V1 experimental engine to prevent software crashes, forcing standard Ethernet communication to bypass the “13.59 GiB” memory profiling deadlock, and utilizing FP8 quantization to optimize weight distribution and KV cache performance. But apparently v1 engine was still being used and in turns I keep getting stuck at I”NFO 04-10 06:49:29 [gpu_model_runner.py:4827] Model loading took 13.59 GiB memory and 31.074518 seconds” As far as my understanding goes, this is a is a deadlock that occurs during the memory profiling phase of vLLM.

I used the spark-vllm-docker from @eugr as my inference engine.

Is there any way to fix this issue, below is my full error log for my enginecore crash
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] EngineCore failed to start.
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1084, in run_engine_core
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 850, in init
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] super().init(
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 116, in init
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] self.model_executor = executor_class(vllm_config)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py”, line 109, in init
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] self._init_executor()
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py”, line 86, in _init_executor
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] self._init_workers_ray(placement_group)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py”, line 379, in _init_workers_ray
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] self.collective_rpc(“load_model”)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py”, line 506, in collective_rpc
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] return fn(*args, **kwargs)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] return func(*args, **kwargs)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py”, line 2981, in get
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] values, debugger_breakpoint = worker.get_objects(
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] File “/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py”, line 1012, in get_objects
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] raise value.as_instanceof_cause()
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.execute_method() (pid=807, ip=169.254.15.36, actor_id=ea309c1583503b202e378b5f01000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0xfa2ca1453500>)
(EngineCore pid=606) ERROR 04-10 08:32:47 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^