Is cudaMallocPitch really needed for 2D arrays / matrices?

The Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#device-memory-accesses seems to say that 2D arrays need to be allocated using cudaMallocPitch for optimial performance.

However the CUBLAS documentation doesn’t mention “pitch” at all (unless my Ctrl-F key is broken).

So which is it?

(1) Operating on compactly stored 2D arrays allocated with cudaMalloc() is fully functional, including via CUBLAS.

(2) Using 2D arrays stored in pitched allocations made with cudaMallocPitch() may improve performance. To use such allocations with CUBLAS, simply adjust the lda, ldb, ldc arguments accordingly.

I find compact storage for 2D matrices easier to deal with and the performance gains from pitched storage are likely minor on modern GPUs (it helps minimize the number of wide loads performed by the GPU due to better data alignment). I have used pitched 2D matrices when using textures.

Thanks!