purpose of padding and cudaMallocPitch()

Is padding data and using cudaMallocPitch() really necessary for devices of compute capability 2.0 ?

In this article,
http://devblogs.nvidia.com/parallelforall/how-access-global-memory-efficiently-cuda-c-kernels

For devices of compute capability 2.0, it seems that there is almost no performance degradation if the data is misaligned, but accessed in a sequential fashion. Are there any concrete benefits to align my matrix data for my CC 2.0 device that outweigh the wasted memory?