Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Hi @dhiaulm - looks like you have downloaded the wrong checkpoint file as described here. Can you please try to download the HF model checkpoint again by cloning correctly?

Hi @deepikasv1703 - It looks like the GPU you’re running on is not supported. Can you confirm which GPU? Also, look at this solution if using a V100 or a GPU that doesn’t support FMHA kernel.

Hi @anjshah - What you said i have tried at v100 or a GPU it is not supporting, could help me to run a docker run time on colab

!docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

this command is not working on colab to use gpu connection

Hi @deepikasv1703 - did you follow these steps to get docker running in colab?