I tried to deploy the default NVDIA Blueprint for RAG use cases in limited GPU environment. In the user guide online it mentioned we need to apply MIG, 9x20GB and 1x80GB. Did someone try this? Llama 3.2 Embedding and Reranking seem they need over 20GB in some tests and fail. The other ones can be satisfied with even 5GB. Finally the ingestion server it requires 26 CPU in the resources, can this be overwritten?
Hi Tony,
We have not tested this yet using limited resources.
You may be able to use smaller models that would run on more limited hardware to accomlish this.
If we want to deploy other LLM models, how do we overwrite this in the helm charts? Or we may use an existing small LLM that was deployed. Will the frontend app (playground) be able to recognize that we have new LLM?