I have a program that launches several thousand blocks of variable thread size (configured by launch parameters) on our P100. What I’ve noticed is that my pen-and-paper calculations show that it should only be using ~200 MB of memory, however when I run nvidia-smi it shows ~600 MB of memory being used. What initially alerted me to this discrepancy was one of our performance monitoring tools reporting periods of 100% device memory utilization - the amount of memory I’m using should not even be approaching the 12GB available, and it doesn’t match up with the numbers nvidia-smi is reporting, so this was very alarming! The last strange part of this whole problem is that the performance tool only reports max memory usage (I have it email me whenever this happens) when I use a specific thread block configuration; it doesn’t appear as if using other configurations triggers this effect. I’m very certain that I’m the only one utilizing the device when this happens.
Any ideas? I’m kind of at a loss as to what’s going on here.