I’m trying to identify bottlenecks in GPU execution performance for deep learning models on Titan V / V100.
I understand that certain requirements must be met for the underlying kernel execution to be performed on Tensor Cores based on https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/
“nvprof” provides an easy way to dump and aggregate API execution stats on GPU, but the list of API calls made does not seem to say whether the execution happened on Tensor Cores or not.
Is there a way to get this info through nvprof or some other way?
If the profiler currently does not (or cannot) split out Tensor Core utilization, it might be a good idea to file an enhancement request with NVIDIA to add that feature. Enhancements requests can be filed using the bug reporting form reached via the registered developer website; simply prefix the synopsis with “RFE:” to mark it as an enhancement request, rather than a functional bug.
Thanks for the tip, @njuffa.
Are there any news on the subject? Is there any way to understand if the load on the GPU is taken care of by CUDA cores or Tensor cores (and in which proportion)?
Scanning cuobjdump / nvdisasm output for HMMA instructions may give you an idea what fraction of the code makes use of the Tensor cores. Combine this with a benchmark run that samples GPU runtime per line of code to get an idea of how much time is spent in the code.