Hello Everyone,
We are developing an AI Assistant using the Mistral LLM(7b model). Currently the inference happens on an A40 GPU.
We want to know the best practices for deploying an on-premise LLM solution using a GPU.
Specifically, we would like to know,
1. Is it possible to productionize a LLM solution with a single GPU.
2. How to effectively manage GPU memory. For each request the GPU memory keeps increasing. Should it be cleared based on some condition? if yes, based on what condition(s) and how to clear it without significant increase in latency?
Above are the current challenges, please let us know what are the other considerations we should be aware of.
If there is any checklist/guidelines that we can refer for on-premise LLM deployment, It is much appreciated.
Thanks,
Jason