RTX-A5000 with docker built via official installation of tensor-rt llm guidelines. I’m using the supported InternVL2-8B model version. 4 tokens generation (using tensorrt-llm profiling option) takes 297msec e2e on session=‘cpp_llm_only’ mode and 166msec e2e on session=‘python’ which is very odd.
would appreciate your help