Our service uses triton server and it shows gpu memory leak over time.
Triton server uses the image, tritonserver:23.11-py3.
Our service process uses grpc cuda shared memory to send requests to the triton server.
cudaMalloc() is done once when initialized, and we reuse the memory when sending and receiving requests.
You can see a lot of increase in GRAM during evening hours when requests are concentrated.
I want to resolve this issue, but don’t know what to do.
Prometheus metric is not helpful also.
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid=“GPU-964c5806-b17b-4615-ba02-453bc6599627”} 21515730944
I checked our cpp code and we did not free the cuda shared memory properly.
So, when the our process exits, the owner of the shared memory becomes triton server.
It makes triton server gram increase.
I modified our code and now gram leak does not happen.
Thanks for the support!