Do I really need cudaMallocPitch() for cublas routines,
or cudaMalloc() is already good enough for performance?
My timing program which computes C(201, 2001) = A(201, 2001) * B(2001, 2001) for 100 times
shows no performance difference between the aligned and the un-aligned version.
So the questions is, when should I use cudaMallocPitch()?