shared memory in 1D array operations

riclas · May 18, 2008, 10:58pm

hi,
would simple operations like multiply, sum, subtract, etc. of two 1D arrays benefit of shared memory?

my thoughts were that it wouldn’t, since they don’t need to share information and only read and write from global memory once.
but when i saw the matrixMul implementation example in SDK i got confused, because it’s just reading one value of each array and multiplying and writing the result back…
how does it benefit?

any suggestions on improving these simple operations performance (examples/sample projects would suffice)? i’m already threading it…

thank you.

seibert · May 19, 2008, 1:35am

The matrix multiplication example is using shared memory to achieve better usage of the memory bus. Without shared memory, you would have to read one of the arrays in row-major order (which is fast), but the other array in column-major order (which is slow since it won’t be coalesced).

The matrixMul example instead breaks down the large matrix multiplication task into a series of smaller sub-matrix multiplications, where both sub-matrices fit into shared memory. The sub-matrix read can be done in a coalesced way, and then once in shared memory, you can read the elements in pretty much any order you want with little performance penalty. (Ok, there can be bank conflicts if you are not careful, but those have a much lower cost than uncoalesced global memory reads.)

For 1D array operations, there is no benefit, since you can read and write all the input and output arrays in a coalesced order directly. There you just want as many warps running as possible so that arithmetic in one warp can be executing while memory reads are waiting in other warps. This is accomplished by having each thread compute exactly one element of the output array, which it sounds like you are already doing.

riclas · May 19, 2008, 11:36am

of course, i wasn’t recalling that a matrix Mul needed lines and columns … noob :\ i understand how it benefits of shared memory …

yes i am calculating one element of the output array in each thread, it seems i’ll have to find other ways to optimize my code.

thank you for your enlightening answer.

Topic		Replies	Views
Summing the rows and columns of a 2D array CUDA Programming and Performance	5	9638	August 29, 2016
Shared Memory Access - Matrix Multiplication CUDA Programming and Performance	1	1022	October 24, 2015
how to use shared memory CUDA Programming and Performance	6	7672	September 5, 2010
Correct Use of Shared Memory? CUDA Programming and Performance	1	711	January 6, 2010
Worth loading all to shared memory? CUDA Programming and Performance	2	2615	February 25, 2008
Matrix multiplication CUDA CUDA Programming and Performance	7	2891	November 12, 2012
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3667	March 31, 2011
Ordered Multiplication of various matrices in shared memory greater minds please help CUDA Programming and Performance	10	2387	June 22, 2009
Matrix column in shared memory CUDA Programming and Performance	2	759	February 23, 2017
A Question from Programming Massively Parallel Processors: A Hands-on Approach CUDA Programming and Performance cuda , kernel	0	622	September 28, 2021

shared memory in 1D array operations

Related topics