hi,
would simple operations like multiply, sum, subtract, etc. of two 1D arrays benefit of shared memory?
my thoughts were that it wouldn’t, since they don’t need to share information and only read and write from global memory once.
but when i saw the matrixMul implementation example in SDK i got confused, because it’s just reading one value of each array and multiplying and writing the result back…
how does it benefit?
any suggestions on improving these simple operations performance (examples/sample projects would suffice)? i’m already threading it…
The matrix multiplication example is using shared memory to achieve better usage of the memory bus. Without shared memory, you would have to read one of the arrays in row-major order (which is fast), but the other array in column-major order (which is slow since it won’t be coalesced).
The matrixMul example instead breaks down the large matrix multiplication task into a series of smaller sub-matrix multiplications, where both sub-matrices fit into shared memory. The sub-matrix read can be done in a coalesced way, and then once in shared memory, you can read the elements in pretty much any order you want with little performance penalty. (Ok, there can be bank conflicts if you are not careful, but those have a much lower cost than uncoalesced global memory reads.)
For 1D array operations, there is no benefit, since you can read and write all the input and output arrays in a coalesced order directly. There you just want as many warps running as possible so that arithmetic in one warp can be executing while memory reads are waiting in other warps. This is accomplished by having each thread compute exactly one element of the output array, which it sounds like you are already doing.