The matrix multiplication example is using shared memory to achieve better usage of the memory bus. Without shared memory, you would have to read one of the arrays in row-major order (which is fast), but the other array in column-major order (which is slow since it won’t be coalesced).
The matrixMul example instead breaks down the large matrix multiplication task into a series of smaller sub-matrix multiplications, where both sub-matrices fit into shared memory. The sub-matrix read can be done in a coalesced way, and then once in shared memory, you can read the elements in pretty much any order you want with little performance penalty. (Ok, there can be bank conflicts if you are not careful, but those have a much lower cost than uncoalesced global memory reads.)
For 1D array operations, there is no benefit, since you can read and write all the input and output arrays in a coalesced order directly. There you just want as many warps running as possible so that arithmetic in one warp can be executing while memory reads are waiting in other warps. This is accomplished by having each thread compute exactly one element of the output array, which it sounds like you are already doing.