There have been numerous posts about it. This one, for example - you can use my repo or Mark’s - his is a little bit more plug-n-play, I believe, but obscures some parameters. Mine is for more flexibility, but may not be as beginner-friendly. Also, my repo builds from the main branch, Mark’s solution uses official VLLM docker from NVidia (which lags behind in model support and performance).