Run VLLM in Spark

Can I concurrently run both oss-20B and oss-120B? I’m currently utilizing TensorRT for simultaneous execution, but I’m experiencing significant performance degradation.

You would have to launch two instances of vllm on different ports. Would that work for you?

You can run two separate instances of vLLM on different ports, like @PrinceHal suggested, just make sure you set --gpu-memory-utilization accordingly, so both take less than 0.9 of total VRAM (e.g. 0.2+0.7).

If you want to access both on a single port, or launch/unload on demand, you can try llama-swap or it’s fork llmsnap which has some vLLM additions.

2 Likes

Not sure if it would be helpful for you, but you could give this a shot for improving performance:

https://forums.developer.nvidia.com/t/vllm-on-gb10-gpt-oss-120b-mxfp4-slower-than-sglang-llama-cpp-what-s-missing/