3000 floats per thread

I have a kernel where I would need to use 3000-6000 floats per thread (i.e. 12-24 KB per thread). According to the CUDA programming guide, each thread can on Kepler maximally use 512 KB of local memory so it should work. However, I suspect that the performance will be terrible and will largely depend on the L1 and L2 cache. Can anyone tell me how terrible the performance would be? I have an Nvidia GTX Titan.

Local memory resides in the same place as global memory. Actually, local memory is like global memory with a thread scope. So, think that your performance will depend on how you access to your data.

It is worth noting that on Kepler, the L1 cache is reserved for local memory, so there are cases where an algorithm implemented with local memory could be faster than global memory.

Access patterns will make a big difference, of course. The most important thing you can do for performance is try to organize the calculation to maintain data locality, and then decide to use local or global memory based on what helps you achieve that.