I loved your article here: NVIDIA Deep Learning Performance Documentation
Could you share the code you’ve used for profiling the thorughput of cuBLAS, cuDNN operations at different input dimentions?
If not, what are some caveats in implementing one? Any advice / guide would be appreciated!