Cuda memory optimization for Jetson Tx1

I have a kernel with 81 grids and 256 threads per threadblock.
The first 201 threads in each block reads the same parts of the global memory.

Example: thread 0 in threadblock 0 reads global memory array with index 0
thread 1 in threadblock 0 reads global memory array with index 1

thread 0 in threadblock 1 reads global memory array with index 0
thread 1 in threadblock 1 reads global memory array with index 1
…and so on

So i have a lot of reads on the same global memory addresses.
Is there a better possibility for that issue because shared memory is not a solution i think??
Texture memory? etc?

The L2 cache will already cover most of the redundant reads.

L1 caches and texture caches are both local to each SM, and fall back to L2 cache as far as I know. So each SM will have to read the same data redundantly.

You could modify your kernel to use const restrict for the global memory pointer. It will make the compiler make reads through the texture cache (alternatively use the __ldg() intrinsic)

I doubt it would provide a big speedup.

Christian