Triton CUDA error: out of memory

I tried to deploy a triton (using version: 23.07) ensemble model with python backend and custom fine tuned Llama2 model

and I am getting:
I0815 16:40:55.662153 1] Failed to initialize Python stub: RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

/usr/local/lib/python3.10/dist-packages/accelerate/utils/ get_max_memory
/usr/local/lib/python3.10/dist-packages/accelerate/utils/ get_balanced_memory
/usr/local/lib/python3.10/dist-packages/transformers/ from_pretrained
/models/fllama/1/ initialize

Hi @itamar6 ,
Can you please try adding /usr/local/lib/python3.10/dist-packages/ to LD_LIBRARY_PATH, and try again?