I have not seen the rationale explained officially, but I could speculate…
The shift in design from the SM of Fermi to the SMX of Kepler changed the ratio of CUDA cores to registers quite dramatically. The number of CUDA cores per multiprocessor went up by a factor of 6, but the number of registers only went up by a factor of 2. Assuming you have to scale up the number of active threads on a multiprocessor proportional to the number of CUDA cores (not exactly true, but close enough), you will have a lot more pressure on the register file on Kepler. Achieving good occupancy in register-heavy kernels will require spilling registers to local memory more often than before.
On Fermi, these local memory spills went into the same L1 cache as the global memory accesses, and I would not be surprised if there was some “cache-thrashing” as these two different kinds of memory access fought over cache lines. One way to solve this problem is to split these two uses into different caches. The texture cache was already there and usage of it for read-only arrays was already an old CUDA trick. It looks like compute capability 3.5 simply expanded it to a full 48 kB per multiprocessor (used to be more like 6-8 kB) and exposed a way to directly use it for normal memory loads.
Note that I have no benchmarks or quantitative analysis that demonstrates this tradeoff in the usage of L1 was a good idea for common kernels. I can only try to connect the dots based on NVIDIA’s published choices.