RAM usage doesn't go down for ASR. Could it be because of Triton's requests cache? If so, why don't the metrics reflect that?

Hardware - GPU (T4)
Operating System - Container-Optimized OS with containerd (cos_containerd) (on a k8s cluster)
Riva Version 2.8.1

I’m experimenting with a deployment on a K8s cluster, using Helm charts. I’m only using a subset of the models, and when they’re loaded and Triton starts, the RAM (not vRAM) usage sits at around 5.8GB

After I send requests for offline ASR, the RAM usage spikes, then goes down, then it hits a plateau of about 8.8 GB, and doesn’t go down back to 5.8GB

The following image shows the Triton metric for RAM consumption nv_cpu_memory_used_bytes and shows the RAM spike, then plateau at 8.8GB

  1. Now, is that Triton caching the results locally in the RAM?
  2. If so, why is it when I investigate metrics related to cache, all metrics are zeros? When investigating the metrics called nv_cache_num_hits_per_model and nv_cache_num_misses_per_model, they’re both zeros, even when sending the same file for transcription.

I have noticed the same thing. riva_server never gets killed. I am calling for an end() and it never kills the process.