We are currently using Triton on EKS in AWS scheduled on a g4dn.xlarge node type (Tesla T4). We were doing a couple of load tests, running requests concurrently and we are noticing that our GPUs are not utilized all the way. We were seeing an average of 80% GPU utilization and 200% CPU utilization. When we were expecting 100% GPU utilization. We are running two models in our Triton server. Described below are both of them:
Yolo v5 Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] } ] output [ { name: "ensemble_detections" data_type: TYPE_FP16 dims: [ 1, 25200, 226 ] } ]
SSCD Large Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] }, { name: "ensemble_bounding_boxes" data_type: TYPE_UINT16 dims: [ 16, 4 ] } ] output [ { name: "ensemble_embeddings" data_type: TYPE_FP32 dims: [ 16, 1024 ] } ]

Attached is the script we were using to perform those concurrent requests and an image of what we are seeing on our side



# Set the number of concurrent requests you want to send


# Define the curl command in a function for clarity
perform_curl_request() {
    # Add your API call here

export -f perform_curl_request

# Loop over the desired range of concurrent requests
for CONCURRENT_REQUESTS in {1..25}; do
    START=$(python3 -c 'import time; print(int(time.time() * 1000))')
    # Use xargs to run the function concurrently
    seq $CONCURRENT_REQUESTS | xargs -I{} -P$CONCURRENT_REQUESTS bash -c 'perform_curl_request'

    END=$(python3 -c 'import time; print(int(time.time() * 1000))')

    DELTA=$(echo "scale=3; $END - $START" | bc)
    RATIO=$(echo "scale=3; $DELTA / (1000 * $CONCURRENT_REQUESTS)" | bc)

    echo "$CONCURRENT_REQUESTS concurrent request(s) done in $DELTA ms, Time ratio: $RATIO"

Is there a way to increase GPU utilization and increase throughput on Triton?


TensorRT Version:
GPU Type: Tesla T4
Nvidia Driver Version: 470.182.03
CUDA Version: V12.2.128
CUDNN Version: 8.9.5
Operating System + Version: Ubuntu 22.04.3 LTS
Python Version (if applicable): python3
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Container, tag: 2.38.0

Steps To Reproduce

  1. Fill in the above function called perform_curl_request() to host your desired api call to hit Triton
  2. Place code in a bash script
  3. Run bash script ā†’ bash
  4. To monitor CPU and GPU utilization. Exec into the container and run:
    `// monitoring gpu memory and utilization
    nvidia-smi --query-gpu=timestamp,,,memory.used,utilization.memory,temperature.gpu,utilization.gpu --format=csv --loop-ms=100

// monitoring triton cpu usage
top -d 0.100 -c -p $(pgrep -dā€™,ā€™ -f triton)`

Hi @evan53 ,
Triton Forum should be able to assist you here.
Moving it to the Forum.

Hi @evan53 ,
Apologies for mis communication.
The forum is currently not active, and would request you to raise your concern on Issues Ā· triton-inference-server/server Ā· GitHub

Thank you.

