Static GPU Memory Usage on NIM Server

When I perform inference to my self-hosted NIM Server using “nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3” as the base image, the GPU memory usage is somewhat static (I think it’s the size of the model), while I have different size of passages for the inference and the GPU Compute usage is fluctuate like shown on the images. I capture the GPU usage using nvtop and when I check using nvidia-smi command, the GPU Memory usage also the same. There is no issue what so ever in the service itself, but I just wondering if this behavior is expected or not. Because as far as I know, the input data should be loaded to the GPU memory in order to be processed by the GPU. I’m particularly interested in this details so that I can optimize the inference more.

Yes, this is expected behaviour. As sequences are processed through the model they have an impact on GPU utilization.

There will be a set amount reserved for the model itself, and then varying amounts based on the sequences, batches, etc.