purpose of padding and cudaMallocPitch()

Is padding data and using cudaMallocPitch() really necessary for devices of compute capability 2.0 ?

In this article,

For devices of compute capability 2.0, it seems that there is almost no performance degradation if the data is misaligned, but accessed in a sequential fashion. Are there any concrete benefits to align my matrix data for my CC 2.0 device that outweigh the wasted memory?