Do I really need cudaMallocPitch() for cublas routines?

Do I really need cudaMallocPitch() for cublas routines,
or cudaMalloc() is already good enough for performance?

My timing program which computes C(201, 2001) = A(201, 2001) * B(2001, 2001) for 100 times
shows no performance difference between the aligned and the un-aligned version.

So the questions is, when should I use cudaMallocPitch()?

Thanks!