I don’t see any reason why using shared memory will speed up a program if the value is only retrieved from global memory once. Is there any reason to do this?
Here’s a contrived example, I have a kernel which takes a value and multiplies it by 5 and puts it into another value, I am presuming this is the fastest way of achieving this:
You are absolutely right. In that example, global memory reads should be fully coalesced and there would be no advantage to using shared memory. However, consider a only slightly different variant of the same idea: