NIM TensorRT-LLM on H100 NVL

yoshio.sugiyama · November 20, 2024, 8:26am

Can we use NIM TensorRT-LLM on H100 NVL?

I have checked profile of nvcr.io/nim/meta/llama-3.1-8b-instruct:1.2.2 on H100 NVL.
It seems TensorRT-LLM does not support H100 NVL.

SYSTEM INFO
- Free GPUs:
  -  [2321:10de] (0) NVIDIA H100 NVL [current utilization: 0%]
  -  [2321:10de] (1) NVIDIA H100 NVL [current utilization: 0%]
MODEL PROFILES
- Compatible with system and runnable:
  - 6a3ba475d3215ca28f1a8c8886ab4a56b5626d1c98adbfe751025e8ff3d9886d (vllm-bf16-tp2)
  - 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1)
  - With LoRA support:
    - a95e5c7221dae587b4fc32448df265320ce79064a970297649d97a84eb9dc3ba (vllm-bf16-tp2-lora)
    - dfd9bee71abb7582246f7fb8c2aedd9119909b9639e1b4b0260ef6865545ede7 (vllm-bf16-tp1-lora)
- Incompatible with system:
  - 0bc4cc784e55d0a88277f5d1aeab9f6ecb756b9049dd07c1835035211fcfe77e (tensorrt_llm-h100-fp8-tp2-latency)
  - 2959f7f0dfeb14631352967402c282e904ff33e1d1fa015f603d9890cf92ca0f (tensorrt_llm-h100-fp8-tp1-throughput)
  - e45b4b991bbc51d0df3ce53e87060fc3a7f76555406ed534a8479c6faa706987 (tensorrt_llm-a10g-bf16-tp4-latency)
  - 7781e7219b2f41c3e560bb90d6f357ce64bbed9203f087d2cfe0ea3a523a04b7 (tensorrt_llm-a100-bf16-tp2-latency)
  - 7f98797c334a8b7205d4cbf986558a2b8a181570b46abed9401f7da6d236955e (tensorrt_llm-h100-bf16-tp2-latency)
  - 0494aafce0df9eeaea49bbca6b25fc3013d0e8a752ebcf191a2ddeaab19481ee (tensorrt_llm-l40s-bf16-tp2-latency)
  - ba515cc44a34ae4db8fe375bd7e5ad30e9a760bd032230827d8a54835a69c409 (tensorrt_llm-a10g-bf16-tp2-throughput)
  - a534b0f5e885d747e819fa8b1ad7dc1396f935425a6e0539cb29b0e0ecf1e669 (tensorrt_llm-l40s-bf16-tp2-throughput)
  - 7ea3369b85d7aee24e0739df829da8832b6873803d5f5aca490edad7360830c8 (tensorrt_llm-a100-bf16-tp1-throughput)
  - 9cff0915527166b2e93c08907afd4f74e168562992034a51db00df802e86518c (tensorrt_llm-h100-bf16-tp1-throughput)
  - 3807be802a8ab1d999bf280c96dcd8cf77ac44c0a4d72edb9083f0abb89b6a19 (tensorrt_llm-l40s-bf16-tp1-throughput)
  - 407c6c5d1e29be9929f41b9a2e3193359b8ebfa512353de88cefbf1e0f0b194e (vllm-bf16-tp4)
  - 6b89dc22ba60a07df3051451b7dc4ef418d205e52e19cb0845366dc18dd61bd6 (tensorrt_llm-l40s-bf16-tp2-throughput-lora)
  - a506c5bed39ba002797d472eb619ef79b1ffdf8fb96bb54e2ff24d5fc421e196 (tensorrt_llm-a100-bf16-tp1-throughput-lora)
  - 40543df47628989c7ef5b16b33bd1f55165dddeb608bf3ccb56cdbb496ba31b0 (tensorrt_llm-h100-bf16-tp1-throughput-lora)
  - 678e6dbe53dd6fe7dc508d22eb0672743ba0b7e735007cfd0b0d2a9e05911fb9 (vllm-bf16-tp4-lora)

I also checked /etc/nim/config/model_manifest.yaml.
It seems TensorRT-LLM support 2330:10de which is NVIDIA H100 80GB HBM3.

2959f7f0dfeb14631352967402c282e904ff33e1d1fa015f603d9890cf92ca0f:
  container_url: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.2.0
  model: meta/llama-3.1-8b-instruct
  release: 1.2.2
  tags:
    feat_lora: false
    gpu: H100
    gpu_device: 2330:10de
    llm_engine: tensorrt_llm
    pp: '1'
    precision: fp8
    profile: throughput
    tp: '1'
  workspace: !workspace
    components:
      - dst: ''
        src:
          files:
            - !name 'LICENSE.txt'
            - !name 'NOTICE.txt'
            - !name 'checksums.blake3'
            - !name 'config.json'
            - !name 'generation_config.json'
            - !name 'model.safetensors.index.json'
            - !name 'special_tokens_map.json'
            - !name 'tokenizer.json'
            - !name 'tokenizer_config.json'
            - !name 'tool_use_config.json'
          repo_id: ngc://nim/meta/llama-3_1-8b-instruct:hf-8c22764-nim1.2
      - dst: trtllm_engine
        src:
          files:
            - !name 'LICENSE.txt'
            - !name 'NOTICE.txt'
            - !name 'checksums.blake3'
            - !name 'config.json'
            - !name 'metadata.json'
            - !name 'rank0.engine'
          repo_id: ngc://nim/meta/llama-3_1-8b-instruct:0.11.1+14957bf8-h100x1-fp8-throughput.1.2.18099815

Can we use NIM TensorRT-LLM on H100 NVL?

calexiuk · November 21, 2024, 10:12pm

Currently the only optimized engines available are listed - and so this particular hardware does not have a support TensorRT-LLM engine!

Let me check back in with you in a few days to confirm if there are any plans for this to change.

bharath.rathinam · November 22, 2024, 1:46am

It would be good to know as we also face the same problem. If possible please provide an estimated timeline on when this support will be added.

Topic		Replies	Views
How to fix 0 compatible profiles? Where to get compatible profiles? Models nim , llama-31-8b-instruct , llama	3	183	August 6, 2024
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	32	November 12, 2024
Nemollm-inference-microservice failed to deploy Models nim , llama3-8b-instruct , llama	1	51	October 22, 2024
NVIDIA 플랫폼 전반에서 Llama 3.1 강화하기 Technical Blog - South Korea llama	1	13	August 2, 2024
NVIDIA H100 Tensor 코어 GPU 및 NVIDIA TensorRT-LLM으로 최고의 추론 성능 달성하기 Technical Blog - South Korea	0	488	December 15, 2023
Achieving Top Inference Performance with the NVIDIA H100 Tensor Core GPU and NVIDIA TensorRT-LLM Technical Blog	1	934	December 14, 2023
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	3227	August 28, 2024
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	245	September 20, 2024
/opt/nim/start-server.sh: line 61: 32 Killed python3 -m vllm_nvext.entrypoints.openai.api_server Container: CUDA	0	181	July 9, 2024
How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM? Models gpu , nim , mistral-7b-instruct-v03	7	126	November 4, 2024

NIM TensorRT-LLM on H100 NVL

Related topics