Whatever data that you think you will need frequently should be staged in shared memory explicitly. After computation, store the results back in global memory and then fetch the next set from global memory to shared and do the same.
Whatever data that you think you will need frequently should be staged in shared memory explicitly. After computation, store the results back in global memory and then fetch the next set from global memory to shared and do the same.
In my code I am am try to this (store global memory values inside shared memory and then re-use it from shared memory ) but I cannot get performance improvement because , card that has compute capability less than 1.2 is not able to do 8-bit memory coalescing . In my code I am using char array and I am trying to coalescing its access.
can you give me any hints for handling this char array (inside device function) so that I will get performance improvement?
I think I did. I asked you to access them using “int *” pointers only while fetching from global memory.
Once they are in shared memory, use them as “characters”