Could you share code used to profile different cuBLAS, cuDNN throughput at different input dimentions?

I loved your article here: NVIDIA Deep Learning Performance Documentation

Could you share the code you’ve used for profiling the thorughput of cuBLAS, cuDNN operations at different input dimentions?

If not, what are some caveats in implementing one? Any advice / guide would be appreciated!