Triton server GPU memory leak for grpc cuda shared memory request

OceanCho · March 19, 2025, 5:53am

Our service uses triton server and it shows gpu memory leak over time.
Triton server uses the image, tritonserver:23.11-py3.

Our service process uses grpc cuda shared memory to send requests to the triton server.
cudaMalloc() is done once when initialized, and we reuse the memory when sending and receiving requests.

You can see a lot of increase in GRAM during evening hours when requests are concentrated.

I want to resolve this issue, but don’t know what to do.
Prometheus metric is not helpful also.

# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid=“GPU-964c5806-b17b-4615-ba02-453bc6599627”} 21515730944

Any suggestions would be very helpful :)

azmeeraazee · July 22, 2025, 7:02pm

OceanCho:

Our service uses triton server and it shows gpu memory leak over time.
Triton server uses the image, tritonserver:23.11-py3.

Our service process uses grpc cuda shared memory to send requests to the triton server.
cudaMalloc() is done once when initialized, and we reuse the memory when sending and receiving requests.

스크린샷 2025-03-19 14-51-01578×291 10.5 KB

You can see a lot of increase in GRAM during evening hours when requests are concentrated.

I want to resolve this issue, but don’t know what to do.
Prometheus metric is not helpful also.

HELP nv_gpu_memory_used_bytes GPU used memory, in bytes

TYPE nv_gpu_memory_used_bytes gauge

nv_gpu_memory_used_bytes{gpu_uuid=“GPU-964c5806-b17b-4615-ba02-453bc6599627”} 21515730944

Any suggestions would be very helpful :)

Check memory release in your code and update Triton/client versions. Use cuda-memcheck or Triton logs to trace GPU memory leaks.

OceanCho · July 25, 2025, 4:09am

I checked our cpp code and we did not free the cuda shared memory properly.
So, when the our process exits, the owner of the shared memory becomes triton server.
It makes triton server gram increase.
I modified our code and now gram leak does not happen.
Thanks for the support!

Topic		Replies	Views
Triton server 20.02/20.03 GPU memory leaks [bug https://developer.nvidia.com/nvidia_bug/3061266] Triton Inference Server - archived	0	784	July 16, 2020
Triton server memory accumulation problem TensorRT cudnn	1	312	March 14, 2024
DeepStream 6.0.1 Triton GRPC memory leak DeepStream SDK nvbugs	23	2829	September 2, 2022
Memory exceeded error when running triton-inference-server General Topics and Other SDKs cuda , inference-server-triton , gpu	0	1260	February 2, 2023
Problem with accumulating gpu memory usage in tritonserver TensorRT cudnn , inference-server-triton , deepstream	0	137	September 3, 2024
Avoid memory copy for deepstream pipeline connecting to a standalone local triton inference server DeepStream SDK docker , inference-server-triton , gpu , grpc , deepstream	2	397	April 1, 2024
GPU memory leak - With dynamic batch models Triton Inference Server - archived	0	631	September 9, 2020
Riva and Triton thread leak and consequent memory leak Riva riva	2	385	June 19, 2024
Memory Leak of deepstream-test3 (using grpc, triton-server) DeepStream SDK nvbugs	19	612	March 10, 2025
Memory leak in IExecutionContext TRT6 TensorRT	1	1270	March 2, 2020

Triton server GPU memory leak for grpc cuda shared memory request

Related topics