Constant memory is almost always fastest, although there are some people that got a speedup by putting data in shared memory at the start of the kernel.
This assumes that you manage to put everything into constant memory (it’s a limited resource), of course. If not, you should try to put the part of the data required by the block in shared memory, and read it from global memory with coalesced reads at the beginning of the kernel.
If that isn’t possible, for example due to an unpredictable access pattern, use textures. Texture raw bandwidth is somewhat slower than global, but it does a kind of local caching, making it ideal for some access patterns. Also, you get bilinear filtering and edge extension/wrapping for free.