GPUs are underutilized with Triton


We are currently using Triton on EKS in AWS scheduled on a g4dn.xlarge node type (Tesla T4). We were doing a couple of load tests, running requests concurrently and we are noticing that our GPUs are not utilized all the way. We were seeing an average of 80% GPU utilization and 200% CPU utilization. When we were expecting 100% GPU utilization. We are running two models in our Triton server. Described below are both of them:
Yolo v5 Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] } ] output [ { name: "ensemble_detections" data_type: TYPE_FP16 dims: [ 1, 25200, 226 ] } ]
SSCD Large Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] }, { name: "ensemble_bounding_boxes" data_type: TYPE_UINT16 dims: [ 16, 4 ] } ] output [ { name: "ensemble_embeddings" data_type: TYPE_FP32 dims: [ 16, 1024 ] } ]

Attached is the script we were using to perform those concurrent requests and an image of what we are seeing on our side



# Set the number of concurrent requests you want to send


# Define the curl command in a function for clarity
perform_curl_request() {
    # Add your API call here

export -f perform_curl_request

# Loop over the desired range of concurrent requests
for CONCURRENT_REQUESTS in {1..25}; do
    START=$(python3 -c 'import time; print(int(time.time() * 1000))')
    # Use xargs to run the function concurrently
    seq $CONCURRENT_REQUESTS | xargs -I{} -P$CONCURRENT_REQUESTS bash -c 'perform_curl_request'

    END=$(python3 -c 'import time; print(int(time.time() * 1000))')

    DELTA=$(echo "scale=3; $END - $START" | bc)
    RATIO=$(echo "scale=3; $DELTA / (1000 * $CONCURRENT_REQUESTS)" | bc)

    echo "$CONCURRENT_REQUESTS concurrent request(s) done in $DELTA ms, Time ratio: $RATIO"

Is there a way to increase GPU utilization and increase throughput on Triton?


TensorRT Version:
GPU Type: Tesla T4
Nvidia Driver Version: 470.182.03
CUDA Version: V12.2.128
CUDNN Version: 8.9.5
Operating System + Version: Ubuntu 22.04.3 LTS
Python Version (if applicable): python3
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Container, tag: 2.38.0

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

  1. Fill in the above function called perform_curl_request() to host your desired api call to hit Triton
  2. Place code in a bash script
  3. Run bash script ā†’ bash
  4. To monitor CPU and GPU utilization. Exec into the container and run:
    `// monitoring gpu memory and utilization
    nvidia-smi --query-gpu=timestamp,,,memory.used,utilization.memory,temperature.gpu,utilization.gpu --format=csv --loop-ms=100

// monitoring triton cpu usage
top -d 0.100 -c -p $(pgrep -dā€™,ā€™ -f triton)`

Hi @evan53 ,
Triton Forum should be able to assist you here.
Moving it to the Forum.

1 Like

Hi @evan53 ,
Apologies for mis communication.
The forum is currently not active, and would request you to raise your concern on Issues Ā· triton-inference-server/server Ā· GitHub

Thank you.

1 Like