Basic Texture Question

I’m trying to find a good source describing how textures can improve performance when compared to traditional global memory reads (note: this is on a 1.3 device, I believe some of the higher cc devices may have the capability to broadcast from global memory???). I keep seeing something along the lines of “textures improve performance when there is a spatial pattern to the memory reads” without much detail being given.

For instance, given threads 1-n, if each thread were to sequentially access pos[i], pos[i+1], and pos[i+2] in the texture, it’s clear (I think) that the texture could cache pos[2] through pos[n] from the first read making the read of pos[i+1] require only pos[n+1] to be read from global memory. A similar sentiment for pos[n+2] with the second read. Is this understanding correct (assuming n is small enough so everything can be cached)?

Now, what if the threads have an access pattern as follows (the number of threads is small to make this simple):
Thread 1: pos[1] pos[3] pos[4]
Thread 2: pos[1] pos[2] pos[3]
Thread 3: pos[2] pos[3] pos[4]

In other words, I’m wondering if textures have some sort of broadcast ability if multiple threads are trying to access the same address or can handle “somewhat-ordered” reads well. The reason I ask is that I have a triangular mesh and need to read a bunch of float4’s corresponding to the node positions that define the faces. The size of the mesh makes loading all of the vertices into shared memory impossible so my options are:

  1. load a subset into shared memory and the rest in global memory. only read from global memory on a miss.
  2. load the entire mesh into global memory and read from there
  3. load the entire mesh into global memory, bind it to a texture, and read from there
  4. ???