About some details of cuDNN/cuBLAS

  1. I’ve read that constraints on the input channel for packed NCHW is lifted from cudNN 7.3. Does the constraint on input and output channel being divisible by 8 still valid for every other configuration except for packed NCHW? So for NHWC, it will still reverse to cuda cores when either in or out channel is not divisible by 8?

If so, then I assume this is because the cost of adding padding to the channel when channel is the inner-most dimension is so large that offsets the benefit of using tensorcore. If C is the inner most, then you are essentially copying the entire matrix with tons of uncoalesced gobal memory read & write. If my guess is wrong, please fill me in with the details. Thanks

  1. Does cuDNN/cuBLAS overlap global memory read/writes and computations? Or do the kernels wait until all global reads are complete -> performs operations -> completes global writes in 3 serial stages with no overlap? If there is, how much overlap is there?


  1. Based on these links: https://docs.nvidia.com/deeplearning/sdk/cudnn-archived/cudnn_765/cudnn-developer-guide/index.html#tensor-ops-conv-functions-data-filter-formats and https://docs.nvidia.com/deeplearning/sdk/cudnn-archived/cudnn_765/cudnn-developer-guide/index.html#tensor-ops-tensor-transformations-padding, it seems like the padding for Channel inputs that aren’t a multiple of 8 only applies to packed NCHW data.

  2. I’m not sure if I understand the question correctly, but this sounds essentially like general concurrency question. If so, then in general, reads can be done concurrently pretty freely, but once writes are involved you need to consider locking to avoid data races so that your results are correct when sharing data among multiple threads. You could probably do the computations in parallel with the reads depending on the problem, but you’d probably need to synchronize your threads and lock accordingly before writing the results of any computations if you have dependent computations/reads afterwards.