Triton CUDA error: out of memory

I tried to deploy a triton (using version: 23.07) ensemble model with python backend and custom fine tuned Llama2 model

and I am getting:
I0815 16:40:55.662153 1 pb_stub.cc:324] Failed to initialize Python stub: RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

At:
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py(624): get_max_memory
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py(731): get_balanced_memory
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py(2824): from_pretrained
/models/fllama/1/model.py(34): initialize

Hi @itamar6 ,
Can you please try adding /usr/local/lib/python3.10/dist-packages/ to LD_LIBRARY_PATH, and try again?