Description
We are currently using Triton on EKS in AWS scheduled on a g4dn.xlarge node type (Tesla T4). We were doing a couple of load tests, running requests concurrently and we are noticing that our GPUs are not utilized all the way. We were seeing an average of 80% GPU utilization and 200% CPU utilization. When we were expecting 100% GPU utilization. We are running two models in our Triton server. Described below are both of them:
Yolo v5 Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] } ] output [ { name: "ensemble_detections" data_type: TYPE_FP16 dims: [ 1, 25200, 226 ] } ]
SSCD Large Ensemble:
input [ { name: "ensemble_raw_image" data_type: TYPE_UINT8 dims: [ -1 ] }, { name: "ensemble_bounding_boxes" data_type: TYPE_UINT16 dims: [ 16, 4 ] } ] output [ { name: "ensemble_embeddings" data_type: TYPE_FP32 dims: [ 16, 1024 ] } ]
Attached is the script we were using to perform those concurrent requests and an image of what we are seeing on our side
.
#!/bin/bash
# Set the number of concurrent requests you want to send
CONCURRENT_REQUESTS=1
SECONDS=0
# Define the curl command in a function for clarity
perform_curl_request() {
# Add your API call here
}
export -f perform_curl_request
# Loop over the desired range of concurrent requests
for CONCURRENT_REQUESTS in {1..25}; do
SECONDS=0
START=$(python3 -c 'import time; print(int(time.time() * 1000))')
# Use xargs to run the function concurrently
seq $CONCURRENT_REQUESTS | xargs -I{} -P$CONCURRENT_REQUESTS bash -c 'perform_curl_request'
END=$(python3 -c 'import time; print(int(time.time() * 1000))')
DELTA=$(echo "scale=3; $END - $START" | bc)
RATIO=$(echo "scale=3; $DELTA / (1000 * $CONCURRENT_REQUESTS)" | bc)
echo "$CONCURRENT_REQUESTS concurrent request(s) done in $DELTA ms, Time ratio: $RATIO"
done
Is there a way to increase GPU utilization and increase throughput on Triton?
Environment
TensorRT Version: 8.6.1.6-1+cuda12.0
GPU Type: Tesla T4
Nvidia Driver Version: 470.182.03
CUDA Version: V12.2.128
CUDNN Version: 8.9.5
Operating System + Version: Ubuntu 22.04.3 LTS
Python Version (if applicable): python3
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Container, tag: 2.38.0
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
- Fill in the above function called perform_curl_request() to host your desired api call to hit Triton
- Place code in a bash script
- Run bash script ā bash
- To monitor CPU and GPU utilization. Exec into the container and run:
`// monitoring gpu memory and utilization
nvidia-smi --query-gpu=timestamp,memory.total,memory.free,memory.used,utilization.memory,temperature.gpu,utilization.gpu --format=csv --loop-ms=100
// monitoring triton cpu usage
top -d 0.100 -c -p $(pgrep -dā,ā -f triton)`