Tensor Ops Made Easier in cuDNN

Originally published at: Tensor Ops Made Easier in cuDNN | NVIDIA Technical Blog

Neural network models have quickly taken advantage of NVIDIA Tensor Cores for deep learning since their introduction in the Tesla V100 GPU last year. For example, new performance records for ResNet50 training were announced recently with Tensor Core-based solutions. (See the NVIDIA developer post on new performance milestones for additional details). NVIDIA’s cuDNN library enables CUDA programmers…

Does the constraint on input and output channel being divisible by 8 still valid for every other configuration except for packed NCHW? So for NHWC, it will still reverse to cuda cores when either in or out channel is not divisible by 8?

If so, then I assume this is because the cost of adding padding to the channel when channel is the inner-most dimension is so large that offsets the benefit of using tensorcore. If C is the inner most, then you are essentially copying the entire matrix with tons of uncoalesced gobal memory read & write. If my guess is wrong, please fill me in with the details. Thanks!