PIC code: shared memory vs texture vs L1 cache (Fermi)

Hello everybody we are trying to implement a Particle in a Cell code on the gpu.
The Code involves particles which can move freely and fields which are stored on a 3d grid (electric and magnetic field).

Our initial idea is that 1 thread = 1 particle.
To calculate the new particle momenta we need the particles postions and 444 (4 in each dimension) field values from the grid. But we have serverall particles in the same cell (between 20 and 1000) so they need the same field values. We thought of loading the field values in shared memory, but if a lot of threads access the same shared memory there will be bank conflicts I guess. So is it better to load the fields through textures and make use of the texture cache?
What about Fermi, does the L1 cache help in that case? Field reads are probably not coalesed because of locality in 3d arrays (need neighbours in each dimension). Any ideas are appreciated. Thanks

Sorry, I know this is an old thread but, eh. What you could do is use constant memory, which is broadcast when all threads in a warp (or maybe half warp, can’t remember) read from the same location. In that case, just make a reasonable kernel size and use shared mem to broadcast…

Anyway, just a theory…