Error When Try to compile llama3 checkpoint using trtllm-build

Description

I’m trying to compile a llama3 8B-Instruct model with TensorRT-LLM and the following error is occurring when using the trtllm-build command:

RuntimeError: Unexpected error from cudaGetDeviceCount(). 
Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? 
Error 500: named symbol not found

Environment

TensorRT Version: 0.8.0[05/29/2024-22:21:37]
GPU Type: NVidia RTX-4060
Nvidia Driver Version: 555.85
CUDA Version: 12.5
CUDNN Version: 12.1.105
Operating System + Version: Windows 11 Pro (host) - Ubuntu 22.04 (container)
Python Version: 3.10
PyTorch Version: 2.1.2+cu121
Baremetal or Container: nvidia/cuda:12.1.0-devel-ubuntu22.04

Steps To Reproduce

I am following the tutorial as indicated on the website Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Complete execution Log:

[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/29/2024-22:30:32] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set lookup_plugin to None.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set lora_plugin to None.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set context_fmha to True.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set remove_input_padding to True.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set multi_block_mode to False.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set enable_xqa to True.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/29/2024-22:30:32] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/29/2024-22:30:32] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 497, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 420, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    torch.cuda.set_device(gpu_id)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 500: named symbol not found

Hi @lerrana ,
This error is usually caused by NVIDIA NVML Driver/library version mismatch.

  1. In a terminal run: lsmod | grep nvidia.

  2. Then unload the module dependent on nvidia driver:

    sudo rmmod nvidia_drm
    sudo rmmod nvidia_modeset
    sudo rmmod nvidia_uvm
    
  3. Finally, unload the nvidia module: sudo rmmod nvidia.

  4. Now when you try lsmod | grep nvidia, you should get nothing in the terminal output.

  5. Now run nvidia-smi to check if you get the desired output.

Thanks