The reasoning behind 512 is simple.
- For best performance warps have to do coalesced memory access.
- Threads can read 16-byte words in a single instruction if the address is 16-byte aligned (e.g. loading int4)
Each warp could theoretically access 32 * 16 = 512 byte in one instruction. The pitch is chosen as multiple of 512 such that it is valid to access each row of pitched memory in this manner.