GPUs are underutilized with Triton

evan53 · November 17, 2023, 4:09am

Description

We are currently using Triton on EKS in AWS scheduled on a g4dn.xlarge node type (Tesla T4). We were doing a couple of load tests, running requests concurrently and we are noticing that our GPUs are not utilized all the way. We were seeing an average of 80% GPU utilization and 200% CPU utilization. When we were expecting 100% GPU utilization. We are running two models in our Triton server. Described below are both of them:
Yolo v5 Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] } ] output [ { name: "ensemble_detections" data_type: TYPE_FP16 dims: [ 1, 25200, 226 ] } ]
SSCD Large Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] }, { name: "ensemble_bounding_boxes" data_type: TYPE_UINT16 dims: [ 16, 4 ] } ] output [ { name: "ensemble_embeddings" data_type: TYPE_FP32 dims: [ 16, 1024 ] } ]

Attached is the script we were using to perform those concurrent requests and an image of what we are seeing on our side

.

#!/bin/bash

# Set the number of concurrent requests you want to send
CONCURRENT_REQUESTS=1

SECONDS=0

# Define the curl command in a function for clarity
perform_curl_request() {
    # Add your API call here
}

export -f perform_curl_request

# Loop over the desired range of concurrent requests
for CONCURRENT_REQUESTS in {1..25}; do
    SECONDS=0
    START=$(python3 -c 'import time; print(int(time.time() * 1000))')
    
    # Use xargs to run the function concurrently
    seq $CONCURRENT_REQUESTS | xargs -I{} -P$CONCURRENT_REQUESTS bash -c 'perform_curl_request'

    END=$(python3 -c 'import time; print(int(time.time() * 1000))')

    DELTA=$(echo "scale=3; $END - $START" | bc)
    
    RATIO=$(echo "scale=3; $DELTA / (1000 * $CONCURRENT_REQUESTS)" | bc)

    echo "$CONCURRENT_REQUESTS concurrent request(s) done in $DELTA ms, Time ratio: $RATIO"
done

Is there a way to increase GPU utilization and increase throughput on Triton?

Environment

TensorRT Version: 8.6.1.6-1+cuda12.0
GPU Type: Tesla T4
Nvidia Driver Version: 470.182.03
CUDA Version: V12.2.128
CUDNN Version: 8.9.5
Operating System + Version: Ubuntu 22.04.3 LTS
Python Version (if applicable): python3
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Container, tag: 2.38.0

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Fill in the above function called perform_curl_request() to host your desired api call to hit Triton
Place code in a bash script
Run bash script → bash
To monitor CPU and GPU utilization. Exec into the container and run:
`// monitoring gpu memory and utilization
nvidia-smi --query-gpu=timestamp,memory.total,memory.free,memory.used,utilization.memory,temperature.gpu,utilization.gpu --format=csv --loop-ms=100

// monitoring triton cpu usage
top -d 0.100 -c -p $(pgrep -d’,’ -f triton)`

AakankshaS · November 17, 2023, 4:53am

Hi @evan53 ,
Triton Forum should be able to assist you here.
Moving it to the Forum.
Thanks

AakankshaS · November 22, 2023, 1:02pm

Hi @evan53 ,
Apologies for mis communication.
The forum is currently not active, and would request you to raise your concern on Issues · triton-inference-server/server · GitHub

Thank you.

Topic		Replies	Views
Problem with accumulating gpu memory usage in tritonserver TensorRT cudnn , inference-server-triton , deepstream	0	132	September 3, 2024
Inconsistant GPU memory utilsation with parallel model instances Triton Inference Server - archived	0	805	July 5, 2021
Error when using ensemble model with deepstream-5.1 : failed to get input buffer in CPU memory DeepStream SDK inference-server-triton	7	1204	September 4, 2021
Low GPU usage in TRTIS CUDA on Windows Subsystem for Linux	7	1501	September 3, 2020
Triton server logs DeepStream SDK	7	5287	May 16, 2022
Unable to run Triton example TensorRT inference-server-triton	1	936	May 31, 2024
GPU support with Triton iGPU image and Python Backend Jetson Orin Nano python	9	371	October 14, 2024
Help with efficient execution of triton ensembles DeepStream SDK inference-server-triton	8	417	March 1, 2024
Tensor RT server with GPU only instances high CPU usage Triton Inference Server - archived	4	2455	February 27, 2020
Triton Server Crashing Running Centerpoint Keypoint (hourglass_512x512_kpts) on Jetson via Dockerized Triton Jetson TX2 jetson-inference , docker , inference-server-triton	6	1172	February 9, 2022

GPUs are underutilized with Triton

Description

Environment

Relevant Files

Steps To Reproduce

Related topics