The tools that I’m aware of that approach those topics are the profilers. They are not readily adaptable for cluster scale monitoring. Perhaps Scott will have some other suggestions. At a higher level, some of these tools may be of interest, e.g. ganglia:
From my perspective, asking questions about warp behavior is something like asking about whether or not the AVX512 intrinsics I am using are actually utilizing every AVX lane.
That seems (to me) like rather more detail than is necessary to answer these questions:
“how busy are these (GPU) servers? do we need to get more capacity?”
From my perspective, the first level of monitoring is simply process monitoring:
- Is a process using the GPU or not? Is the GPU currently claimed by a process?
The next level of monitoring would be GPU utilization within the process:
- is the process allocating memory on the GPU? what percentage of total?
- when the process is using a GPU, how often are CUDA kernels being run during that time?
All of these levels of monitoring or question answering are supported by nvidia-smi
From my perspective, there are 2 different kinds of monitoring:
- How much activity is there on the GPUs?
- What is the quality (nature) of the activity on the GPUs?
To meet capacity demand on a near term basis, only item 1 is important (I think). If someone is using a GPU, for most use cases I am aware of, no one else can or should be using that GPU. It doesn’t matter much what sort of activity is going on.
Item 2 comes into play when datacenter management wants to encourage their users to make more effective use of the GPU cycles they are consuming already. It does not fundamentally address the capacity question, except on a long term basis as users are encouraged to run more efficient codes.