Hello,
I’m struggling with an issue where both nvidia-smi
and nvtop
are showing GPU RAM usage almost at 100% (15.5GB out of 16GB) for the single GPU on the system, but the single process in the list is shown at 0.
On top of this, from inside the process, if I do this:
t = torch.cuda.get_device_properties(0).total_memory
r = torch.cuda.memory_reserved(0)
a = torch.cuda.memory_allocated(0)
Logger.info(f"Total memory: {t}, Reserved memory: {r}, Allocated memory: {a}, Available memory: {r-a}")
It shows total around 15GB, reserved around 500MB, allocated around 480MB.
The process has always something to do, however, GPU compute % is always at 0, except for brief moments when it goes to around 20% for a split second.
Service is running on GKE, in a NodePool with a single GPU and where a single Pod is running, so it’s effectively the only Pod using the GPU. The GPU is a Tesla T4. NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
.
I don’t understand if this has to do with some limitations to the way GPU are used in GKE or if something is wrong.
What do you think? Thanks for the help!