- I’ve read that constraints on the input channel for packed NCHW is lifted from cudNN 7.3. Does the constraint on input and output channel being divisible by 8 still valid for every other configuration except for packed NCHW? So for NHWC, it will still reverse to cuda cores when either in or out channel is not divisible by 8?
If so, then I assume this is because the cost of adding padding to the channel when channel is the inner-most dimension is so large that offsets the benefit of using tensorcore. If C is the inner most, then you are essentially copying the entire matrix with tons of uncoalesced gobal memory read & write. If my guess is wrong, please fill me in with the details. Thanks
- Does cuDNN/cuBLAS overlap global memory read/writes and computations? Or do the kernels wait until all global reads are complete → performs operations → completes global writes in 3 serial stages with no overlap? If there is, how much overlap is there?