High libcuda CPU usage preventing full GPU usage

Hello,

I have Llama-3-8B-Instruct running on a L4 GPU(GCP VM). When I am doing the inference, I see the GPU usage around 50%. Digging a little further, I notice that one CPU core is at 100% through out the inference, so I am guessing that this is a bottleneck preventing full usage of the GPU. Upon CPU profiling, I notice that most of this CPU usage is related to libcuda. So, I am wondering if this is normal or if there is something wrong with my env that is leading to this behavior.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================```

Below is my code if it helps
    self.pipeline = pipeline(
        "text-generation",
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        model_kwargs={"torch_dtype": torch.bfloat16},
        device="cuda",
    )
    prompt = self.pipeline.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    terminators = [
        self.pipeline.tokenizer.eos_token_id,
        self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    ]

    outputs = self.pipeline(
        prompt,
        max_new_tokens=max_length,
        eos_token_id=terminators,
        do_sample=False,
        temperature=0.0,
        top_p=0.9,
    )

Sanity check 1: Except for your Llama run, no other processes or other users on this system are utilizing the GPU, correct?

Sanity check 2: “GCP VM” I assume this is some sort of virtualization layer although I do not recognize “GCP”. If so, virtualization adds CPU overhead, try running on bare metal.

That is a question for the Llama people: I would suggest using whatever support infrastructure they provide via forums, mailing lists, GitHub etc. It may be a simple issue of configuration settings or the particulars of the workload being used here that prevent the GPU from being utilized fully.

A common mistake when building GPU-accelerated systems is to use a CPU that is too slow. As GPU performance has grown quickly over the past decade, it is now possible (and has been observed in real life) that the largely serial CPU portion of the code becomes a bottleneck. This includes CUDA driver overhead, e.g. for handling memory allocations. My standing recommendation is to use CPUs with > 3.5 GHz.

The second common mistake when configuring GPU accelerated systems is to provide too little system memory. System memory should be 2x to 4x the total GPU memory (skewed towards the latter factor where performance matters).

Sanity check 1: Except for your Llama run, no other processes or other users on this system are utilizing the GPU, correct?

Yes, nothing else is running on the VM. GPU usage hovers around 50% (when checked via nvidia-smi) and CPU usage of 1 core is at 100%.

Sanity check 2: “GCP VM” I assume this is some sort of virtualization layer although I do not recognize “GCP”. If so, virtualization adds CPU overhead, try running on bare metal.

GCP stands for Google Cloud Platform.

I have also posted on the Llama forum, at this point I am not sure where the optimization needs to happen, hence posted here also. Google cloud L4 GPUs come with a fixed configuration of the host VM, so I don’t really have an option to tweak the CPU/Memory configuration. But I can atleast see that all other CPU cores are idle and the CPU RAM usage is almost negligible once the model has been loaded to GPU.

Does the Google Cloud Platform instance type you chose give you access to 100% of the GPU? You may want to mention the instance you are using to provide additional context. While am not familiar with cloud provider instances it may provide clues to other people that do have that expertise.

Yes, we get access to the full GPU. It is not shared. From the looks of it CPU is the bottleneck preventing the full GPU usage.
g2-standard-4 is the VM type.

NVIDIA L4 spec says it sports 24 GB of memory, Ada Lovelace architecture, released in 2023.

Google specifications for the g2-standard-4 instance state 16 GB of system memory; CPU with 2.2 GHz / 2.9 GHz base/boost clock, Cascade Lake architecture (which dates to 2019).

This does not seem like a well-balanced GPU-accelerated system by the heuristics I outlined earlier. Whether this is the root cause for your observations I cannot say. It is quite likely a contributing factor.