I have a matrix to which I need to add a certain quantity. Let’s say 2 to each element.
In order for a kernel to do this, and since the algorithm is basically:
1-readElement
2-Add 2
3-writeElement
the question is: Is it worth it to load each submatrix into shared memory to perform the operations or can I just read modify and write the submatrix through registers? I am assuming coalesced global memory reads in both cases.
My opinion is that doing it with registers I can save shared memory.