Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM

Originally published at: https://developer.nvidia.com/blog/tune-and-deploy-lora-llms-with-nvidia-tensorrt-llm/

Large language models (LLMs) have revolutionized natural language processing (NLP) with their ability to learn from massive amounts of text and generate fluent and coherent texts for various tasks and domains. However, customizing LLMs is a challenging task, often requiring a full training process that is time-consuming and computationally expensive. Moreover, training LLMs requires a…

Hello, I’m having problems following the tutorial: Deploying LoRA tuned models with Triton and inflight batching.

After creating the image: make -C docker release_build
And run the container: docker run --gpus all --shm-size=2g -p 8000:8000 --ulimit memlock=-1 --rm -it c97ad4afed1e bash

I have not been able to find the script: tensorrt_llm/examples/llama/build.py
just this one: /app/tensorrt_llm/examples/enc_dec/build.py but the gives me the following error:
build.py: error: unrecognized arguments: --model_dir --max_input_len 512 --lora_target_modules attn_q attn_k attn_v --use_inflight_batching --paged_kv_cache --max_lora_rank 8

With this script: /usr/local/lib/python3.10/dist-packages/ammo/deploy/llm/tensorrt_llm_build.py it only returns the TensorRT-LLM version: [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040900

Where can I find the build.py script?