If I have linear device memory arranged as a T[H][W] array (for some type T), when is it necessary or advantageous to add row padding so that is actually a T[H][W+pad] arrangement?
One case I know of is if I am using Memcpy2D. It requires that the pitch, which is = sizeof T[W+pad] be a multiple of 512. This is true on my Kepler GPU, and could be different on other architectures. Memcpy2DUnaligned lacks this restriction, but may run slower. MemcpyD2D32, etc., will be OK but may run slower. So if I want to use any of these functions, I would want to use MemAllocPitch to allocate the memory.
Another case is cuTexRefSetAddress, where the both the memory address and pitch have to be appropriately aligned.
Other than these, are there any other cases where padding is either required or is faster?
If a warp is accessing a row of the memory, and it is not aligned, then the references may span extra lines of the cache, but I don’t expect that this is a performance hit. Please correct me if I’m wrong.
If I am using unified memory, I don’t know if these requirements or performance advantages still apply. Since the driver can access the device memory as host virtual addresses, it might not be an issue, and in fact having no gap between rows might be faster performance. I’ll experiment with this on my Kepler and update this post with the results. I found that memcpy2D and memset2D don’t require pitch alignment for unified memory.
BTW, can I safely use CU_DEVICE_ATTRIBUTE_TEXTURE_PITCH_ALIGNMENT to determine the alignment of the pitch in all cases?