Batch processing using NVIDIA NIM | Docker | Self-hosted

From the documentation, we can run the NIM as follows

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3-8b-instruct:latest

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
prompt = "Once upon a time"
response = client.completions.create(
    model="meta/llama3-8b-instruct",
    prompt=prompt,
    max_tokens=16,
    stream=False
)
completion = response.choices[0].text
print(completion)

Does it support batch inference?

Hi @mohammed.innat – NIM doesn’t support sending batched inference requests, so you’ll need to send prompts one at a time. The server might batch multiple requests together to process things more efficiently, but there’s no explicit control over that.

Could you please elaborate on this? What are the best and safest way to send multiple request to NIM?

Say, I’ve 5 GPU. And I want to get leverage from all during inference. Following is my current setup but as it’s mentioned NIM doesn’t support batch processing (Huggingface does), I think following implementation has a limitation for batch processing.

import os
import torch
import time
import torch.distributed as dist

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

prompts = [
    "what is tensorflow?",
    "what is torch?",
    "what is jax?",
    "what is keras 3",
     "what is mlx",
]
messages = [
    {"role": "user", "content": prompts[0]},
    {"role": "user", "content": prompts[1]},
    {"role": "user", "content": prompts[2]},
]

def infer_with_device(prompts, model, max_tokens):
    rank = dist.get_rank()
    world_size = dist.get_world_size()

    batch_size_per_gpu = len(prompts) // world_size
    start_idx = rank * batch_size_per_gpu
    end_idx = (rank + 1) * batch_size_per_gpu if rank != world_size - 1 else len(prompts)
    batch = prompts[start_idx:end_idx]

    print(f"Rank {rank} processing batch: {batch}")

    with torch.cuda.device(rank):
        start_time = time.time()
        # torch.cuda.synchronize() # cause OOM
        response = client.chat.completions.create(
            model=model,
            messages=batch,
            max_tokens=max_tokens,
            stream=False
        )
        results = response.choices[0].message.content
        inference_time = time.time() - start_time

    print(f"Rank {rank} inference time: {inference_time:.4f} seconds")
    return results


def main():
    dist.init_process_group(backend="nccl", init_method="env://")
    rank = dist.get_rank()

    # inference
    results = infer_with_device(
        excel_prompt,
        model="meta/llama3-8b-instruct",
        max_tokens=2048,
    )
    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Also, while running it using the following command, how to enable lora or some other config. In the log message, it says

Ingc_injector.py:147] Profile metadata: feat_lora: false
Ingc_injector.py:147] Profile metadata: llm_engine: vllm
Ingc_injector.py:147] Profile metadata: precision: fp16
Ingc_injector.py:147] Profile metadata: tp: 2

I got 3 GPU (each with 48GB), and sometimes it causes OOM.

  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/activation.py", line 34, in forward
    out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 27.38 MiB is free. Process 3255 has 430.89 MiB memory in use. Process 5546 has 46.71 GiB memory in use. Of the allocated memory 40.85 GiB is allocated by PyTorch, and 84.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I tried to use meta/llama-3.1-8b-instructwhich used TensorRT-LLM engine as backend. After running it, I was able to run it without facing any Out-of-memory error. But the following log message was found. I’ve 3 GPU, and while running the docker I set --gpus all.

INFO utils.py:237] Using provided selected GPUs list [0, 1]

However, if I do run the following command, then the selected GPU ([0, 2]) will actively showing in the terminal. This looks like a bug to me in NIM.

docker run -it --rm \
    --gpus '"device=0,2"' \
    --shm-size=32GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3.1-8b-instruct:latest

Additionally, about the following docker image

docker run -it --rm \
    --gpus all \
    --shm-size=32GB \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3-8b-instruct:latest

It uses vLLM as backend engine. But there is no tag (AFAIK) which uses TensorRT-LLM.

To ensure that the local GPU(s) are in good shape, I’ve run the following parallel processing and it works as expected. So, when running the docker with --gpu all showed the above inactive gpu:2 seems like a bug to me of NIM.

import torch
from concurrent.futures import ThreadPoolExecutor
from torchvision.models import efficientnet_v2_l

def process_sample_on_gpu(gpu_id, sample):
    torch.cuda.set_device(gpu_id)
    model = get_heavy_model()  
    model = model.to(f"cuda:{gpu_id}")
    model.eval() 
    sample = sample.to(f"cuda:{gpu_id}")
    
    with torch.no_grad():
        output = model(sample)
    
    print(f"Processed sample on GPU {gpu_id} with output shape: {output.shape}")
    return output

def get_heavy_model():
    model = efficientnet_v2_l(pretrained=False)
    return model

def generate_large_sample(batch_size=20, channels=3, height=1024*2, width=1024*2):
    return torch.randn(batch_size, channels, height, width)

samples = [generate_large_sample() for _ in range(3)]
gpu_ids = [0, 1, 2]

with ThreadPoolExecutor(max_workers=len(gpu_ids)) as executor:
    futures = [
        executor.submit(process_sample_on_gpu, gpu_id, sample)
        for gpu_id, sample in zip(gpu_ids, samples)
    ]
    results = [future.result() for future in futures]

print("All samples processed.")

Also, while running the following command

docker run -it --rm \
    --gpus all \
    --shm-size=32GB \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3-8b-instruct:latest

Does it prepare for data-parallel or model-parallel computation? In the followng log message, it is mentioned: Profile metadata: tp: 2. Does tp refer to tensor processing and that means, the layers of the model are splitted across detected GPUs? If so, how to run this docker image with data-parallel?

INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: tp: 2

Full log message

docker run -it --rm \
    --gpus all \
    --shm-size=32GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-26 07:39:03,788 [INFO] PyTorch version 2.2.2 available.
2025-01-26 07:39:04,403 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-26 07:39:04,403 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2025-01-26 07:39:04,458 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-26 07:39:05.15 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-26 07:39:05.17 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-26 07:39:05.17 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-26 07:39:05.17 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-26 07:39:05.17 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-26 07:39:05.17 ngc_injector.py:142] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-26 07:39:05.22 ngc_injector.py:147] Profile metadata: tp: 2
INFO 01-26 07:39:05.22 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-26 07:39:05.23 ngc_injector.py:173] Model workspace is now ready. It took 0.001 seconds
2025-01-26 07:39:06,909 INFO worker.py:1749 -- Started a local Ray instance.
INFO 01-26 07:39:07.954 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-yba7l2iw', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-yba7l2iw', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-26 07:39:08.133 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:10.963 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:10 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-26 07:39:11 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:11 selector.py:28] Using FlashAttention backend.
INFO 01-26 07:39:12 pynccl_utils.py:43] vLLM is using nccl==2.19.3
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:12 pynccl_utils.py:43] vLLM is using nccl==2.19.3
INFO 01-26 07:39:12.187 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 01-26 07:39:12 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:12 utils.py:130] reading GPU P2P access cache from /opt/nim/.cache/vllm/vllm/gpu_p2p_access_cache_for_0,1.json
(RayWorkerWrapper pid=7440) WARNING 01-26 07:39:12 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-26 07:39:18 model_runner.py:173] Loading model weights took 7.4829 GB
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:18 model_runner.py:173] Loading model weights took 7.4829 GB
INFO 01-26 07:39:19 ray_gpu_executor.py:217] # GPU blocks: 33915, # CPU blocks: 4096
INFO 01-26 07:39:21 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-26 07:39:21 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:21 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:21 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=7440) INFO 01-26 07:39:26 model_runner.py:1054] Graph capturing finished in 6 secs.
INFO 01-26 07:39:26 model_runner.py:1054] Graph capturing finished in 6 secs.
WARNING 01-26 07:39:26.880 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:26.887 serving_chat.py:347] Using default chat template:
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
WARNING 01-26 07:39:27.40 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-26 07:39:27.48 api_server.py:456] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
INFO 01-26 07:39:27.48 api_server.py:460] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama3-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

INFO 01-26 07:39:27.80 server.py:82] Started server process [31]
INFO 01-26 07:39:27.80 on.py:48] Waiting for application startup.
INFO 01-26 07:39:27.85 on.py:62] Application startup complete.
INFO 01-26 07:39:27.87 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Additionally, if I check model profiles for llama-3-8B NIM, I got the following profiles, where some lora checkpoints are available but, I’ve RTX A6000 Ada. Does it mean that, no supported lora checkpoints are available?

I have no name!@4f5131e1fc07:/$ list-model-profiles
SYSTEM INFO
- Free GPUs:
  -  [26b1:10de] (0) NVIDIA RTX 6000 Ada Generation (RTX A6000 Ada) [current utilization: 2%]
  -  [26b1:10de] (1) NVIDIA RTX 6000 Ada Generation (RTX A6000 Ada) [current utilization: 1%]
  -  [26b1:10de] (2) NVIDIA RTX 6000 Ada Generation (RTX A6000 Ada) [current utilization: 1%]
MODEL PROFILES
- Compatible with system and runnable:
  - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
  - 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
  - With LoRA support:
    - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
- Incompatible with system:
  - dcd85d5e877e954f26c4a7248cd3b98c489fbde5f1cf68b4af11d665fa55778e (tensorrt_llm-h100-fp8-tp2-latency)
  - f59d52b0715ee1ecf01e6759dea23655b93ed26b12e57126d9ec43b397ea2b87 (tensorrt_llm-l40s-fp8-tp2-latency)
  - 30b562864b5b1e3b236f7b6d6a0998efbed491e4917323d04590f715aa9897dc (tensorrt_llm-h100-fp8-tp1-throughput)
  - 09e2f8e68f78ce94bf79d15b40a21333cea5d09dbe01ede63f6c957f4fcfab7b (tensorrt_llm-l40s-fp8-tp1-throughput)
  - a93a1a6b72643f2b2ee5e80ef25904f4d3f942a87f8d32da9e617eeccfaae04c (tensorrt_llm-a100-fp16-tp2-latency)
  - e0f4a47844733eb57f9f9c3566432acb8d20482a1d06ec1c0d71ece448e21086 (tensorrt_llm-a10g-fp16-tp2-latency)
  - 879b05541189ce8f6323656b25b7dff1930faca2abe552431848e62b7e767080 (tensorrt_llm-h100-fp16-tp2-latency)
  - 24199f79a562b187c52e644489177b6a4eae0c9fdad6f7d0a8cb3677f5b1bc89 (tensorrt_llm-l40s-fp16-tp2-latency)
  - 751382df4272eafc83f541f364d61b35aed9cce8c7b0c869269cea5a366cd08c (tensorrt_llm-a100-fp16-tp1-throughput)
  - c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)
  - cb52cbc73a6a71392094380f920a3548f27c5fcc9dab02a98dc1bcb3be9cf8d1 (tensorrt_llm-h100-fp16-tp1-throughput)
  - d8dd8af82e0035d7ca50b994d85a3740dbd84ddb4ed330e30c509e041ba79f80 (tensorrt_llm-l40s-fp16-tp1-throughput)
  - 9137f4d51dadb93c6b5864a19fd7c035bf0b718f3e15ae9474233ebd6468c359 (tensorrt_llm-a10g-fp16-tp2-throughput-lora)
  - cce57ae50c3af15625c1668d5ac4ccbe82f40fa2e8379cc7b842cc6c976fd334 (tensorrt_llm-a100-fp16-tp1-throughput-lora)
  - 3bdf6456ff21c19d5c7cc37010790448a4be613a1fd12916655dfab5a0dd9b8e (tensorrt_llm-h100-fp16-tp1-throughput-lora)
  - 388140213ee9615e643bda09d85082a21f51622c07bde3d0811d7c6998873a0b (tensorrt_llm-l40s-fp16-tp1-throughput-lora)
I have no name!@4f5131e1fc07:/$

And if I do

download-to-cache decide the most optimal profile given the hardware to download by providing no profiles to download

INFO 01-26 08:05:16.768 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-26 08:05:16.768 ngc_profile.py:220] Detected 2 compatible profile(s).
INFO 01-26 08:05:16.768 ngc_injector.py:107] Valid profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2) on GPUs [0, 1, 2]
INFO 01-26 08:05:16.768 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0, 1, 2]
INFO 01-26 08:05:16.768 pre_download.py:31] Selected profile: 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
INFO 01-26 08:05:16.768 pre_download.py:65] Downloading contents for profile 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f
INFO 01-26 08:05:16.768 pre_download.py:71] {
  "feat_lora": "false",
  "llm_engine": "vllm",
  "precision": "fp16",
  "tp": "2"
}

To set specific profile, it can be followed Model Profiles — NVIDIA NIM for Large Language Models (LLMs)

docker run -it --rm \
    --gpus all \
    --shm-size=32GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_MODEL_PROFILE=ID \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:latest

Is there an option in the client API to assign a model to a specific GPU? Additionally, what is the recommended approach for batch inference? For example, if I have 8 GPUs and want to send one sample to each GPU (a total of 8 samples), I would like the client API to load the model on each GPU and perform inference simultaneously. I could do this in the usual way using PyTorch (reference: link), but after adopting the Python multiprocessing approach, I’m not sure if it works as expected. Could you please confirm this or point me to the appropriate documentation for further details?

from concurrent.futures import ThreadPoolExecutor
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

def infer_with_device(prompt, model, max_tokens, gpu_id):
    start_time = time.time()
    print(f"GPU {gpu_id}: Processing...")
    torch.cuda.set_device(gpu_id)
    response = client.chat.completions.create(
        model=model,
        messages=[prompt],
        max_tokens=max_tokens,
        stream=False
    )
    result = response.choices[0].message.content
    inference_time = time.time() - start_time
    print(f"GPU {gpu_id} in {inference_time:.2f} seconds.")
    return result


def process_prompts_with_gpus(file_list, model, max_tokens, gpu_ids):
    dataframes = [pd.read_csv(i) for i in file_list]
    prompts = [prompt_template(i) for i in dataframes]
    num_gpus = len(gpu_ids)
    total_prompts = len(prompts)
    print(f"Distributing {total_prompts} prompts among {num_gpus} GPUs.")

    results = [None] * total_prompts 
    with ThreadPoolExecutor(max_workers=num_gpus) as executor:
        futures = []
        for i, prompt in enumerate(prompts):
            gpu_id = gpu_ids[i % num_gpus]
            future = executor.submit(infer_with_device, prompt, model, max_tokens, gpu_id)
            futures.append((i, future))

        for i, future in futures:
            results[i] = future.result()

    return results


# len: 50
myd = [
    '67.csv',
    '47.csv',
    '88.csv',
    ...
]

def main(model, gpu_ids):
    max_tokens = 2048
    start_time = time.time()
    process_prompts_with_gpus(myd , model, max_tokens, gpu_ids)
    total_time = time.time() - start_time
    print(f"\nTotal execution time: {total_time:.2f} seconds.")

gpu_ids = [0, 1, 2] 
# gpu_ids = [0, 1] 
# gpu_ids = [0]  
model = "meta/llama3-8b-instruct"
final_dataframes = main(model, gpu_ids)

FYI, While running the docker, I used vllm-fp16-tp1 profile.

  • Free GPUs:
    • [26b1:10de] (0) NVIDIA RTX 6000 Ada Generation [current utilization: 2%]
    • [26b1:10de] (1) NVIDIA RTX 6000 Ada Generation [current utilization: 1%]
    • [26b1:10de] (2) NVIDIA RTX 6000 Ada Generation [current utilization: 1%]
      MODEL PROFILES
  • Compatible with system and runnable:
    • 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
    • 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
    • With LoRA support:
      • c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
      • 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)

Hi @mohammed.innat – Let me try to address as much as I can here.

Could you please elaborate on this? What are the best and safest way to send multiple request to NIM?

What you should do in this instance is send your prompts from the client to the NIM server in an asynchronous loop, one at a time. The NIM server will automatically determine how to batch the requests together for processing. There’s no need to assign a cuda context or set up any distributed process groups on the client side.

Say, I’ve 5 GPU. And I want to get leverage from all during inference. Following is my current setup but as it’s mentioned NIM doesn’t support batch processing (Huggingface does), I think following implementation has a limitation for batch processing.

I’m a little confused by this code snippet. If you are sending the prompts from a client to a server, there is no benefit to running the client code with different CUDA contexts. All of the GPU processing is done by the server application. In the example you linked, there is no client-server split – all of the processing is done by a single script and therefore it’s helpful to split the work across multiple GPUs.

Also, while running it using the following command, how to enable lora or some other config.

Take a look at the docs for enabling LORA here: Parameter-Efficient Fine-Tuning — NVIDIA NIM for Large Language Models (LLMs)

I tried to use meta/llama-3.1-8b-instruct which used TensorRT-LLM engine as backend. After running it, I was able to run it without facing any Out-of-memory error. But the following log message was found. I’ve 3 GPU, and while running the docker I set --gpus all .

It uses vLLM as backend engine. But there is no tag (AFAIK) which uses TensorRT-LLM .

NIM automatically determines what backend engine to use and how many GPUs to execute on based on the model profile logic explained here: Model Profiles — NVIDIA NIM for Large Language Models (LLMs). If you want to change the backend or execution profile, there’s instructions on that page for how to do so. In particular, NIM won’t necessarily use all of the GPUs assigned to the container if there’s a different execution profile available.

Does it prepare for data-parallel or model-parallel computation? In the followng log message, it is mentioned: Profile metadata: tp: 2 . Does tp refer to tensor processing and that means, the layers of the model are splitted across detected GPUs? If so, how to run this docker image with data-parallel?

TP refers to Tensor Parallel. You can see a conceptual overview of Tensor Parallelsim here: Parallelisms — NVIDIA NeMo Framework User Guide. Data Parallelism doesn’t really apply when performing inference since there is no “all-reduce” type operation across instances – it’s equivalent to running multiple independent containers on separate GPUs. So you can achieve this by launching multiple NIM containers and distributing the requests across them.

Additionally, if I check model profiles for llama-3-8B NIM, I got the following profiles, where some lora checkpoints are available but, I’ve RTX A6000 Ada. Does it mean that, no supported lora checkpoints are available?

The following profiles in the snippet you shared support LoRA:

  - With LoRA support:
    - c5ffce8f82de1ce607df62a4b983e29347908fb9274a0b7a24537d6ff8390eb9 (vllm-fp16-tp2-lora)
    - 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)

Is there an option in the client API to assign a model to a specific GPU?

No – the idea here is that NIM determines which GPUs to run the model on server-side, and it is not something that needs to be specified on the client side.

Additionally, what is the recommended approach for batch inference?

The recommended approach is to send the requests individually asynchronously. The server has a batching scheduler that will determine how to batch the requests server-side.

1 Like

Thank you for the detailed response! Based on your input, I’ve updated my setup accordingly. My expectation is that during the inference process, I’ll be able to monitor all the available GPUs actively working and observe a noticeable speedup as more GPUs are utilized.

Starting the docker

# model-profile: tp=1
docker run -it --rm \
    --gpus all \
    --shm-size=32GB \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_MODEL_PROFILE=8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:latest
# run.py
import asyncio
import os
import pandas as pd
from openai import OpenAI
from prompt import prompt_template
import logging


logging.basicConfig(
    filename="batch_inference.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
)
logging.info("Starting inference script...")


client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

async def infer_async(prompt, model, max_tokens):
    response = client.chat.completions.create(
        model=model,
        messages=[prompt],
        max_tokens=max_tokens,
        stream=False,
    )
    return response.choices[0].message.content


async def process_prompts_async(prompts, model, max_tokens):
    tasks = [infer_async(prompt, model, max_tokens) for prompt in prompts]
    results = await asyncio.gather(*tasks)
    return results


def split_data(file_list, num_splits):
    return [file_list[i::num_splits] for i in range(num_splits)]


async def main(model, max_tokens, gpu_id, prompts):
    print(f"GPU {gpu_id}: Processing {len(prompts)} prompts.")
    results = await process_prompts_async(prompts, model, max_tokens)
    print(f"GPU {gpu_id}: Completed processing.")
    return results


if __name__ == "__main__":
    model = "meta/llama3-8b-instruct"
    max_tokens = 2048
    num_gpus = 3  # Number of GPUs available
    root = os.path.join(os.getcwd(), "sample_files")
    file_list = [os.path.join(root, i) for i in os.listdir(root)]

    # Load data and prepare prompts
    dataframes = [pd.read_excel(file) for file in file_list]
    prompts = [prompt_template(df[:-1]) for df in dataframes]

    # Split prompts into subsets for each GPU
    prompt_splits = split_data(prompts, num_gpus)

    # Run asynchronous inference for each GPU subset
    loop = asyncio.get_event_loop()
    tasks = [
        main(model, max_tokens, gpu_id, prompt_splits[gpu_id])
        for gpu_id in range(num_gpus)
    ]
    all_results = loop.run_until_complete(asyncio.gather(*tasks))

    # Flatten results from all GPUs
    results = [result for gpu_results in all_results for result in gpu_results]

    print(f"Total prompts processed: {len(results)}")

run: python run.py

GPU 0: Processing 34 prompts.
GPU 1: Processing 33 prompts.
GPU 2: Processing 33 prompts.
GPU 0: Completed processing.
GPU 1: Completed processing.
GPU 2: Completed processing.
Total prompts processed: 100

Using the nvidia-smi, I can see only the first gpu is active and rest is idle. Is this expected?

I ended up running multiple container to achieve the following approach.

Running docker as:

docker run -it --gpus '"device=0"' -p 8000:8000.. 
docker run -it --gpus '"device=1"' -p 8001:8000... 
docker run -it --gpus '"device=2"' -p 8002:8000... 

However, I’d like to confirm one thing: When running a self-hosted NIM on a single-node multi-GPU machine with the model profile set to tp=1 and the gpu all parameter, only one GPU becomes active while the others remain idle. Consequently, during inference with NIM, only the active GPU is utilized, leaving the remaining GPUs unused. And there is no explicit control to make use of other GPUs.