F.5.3 in the Programming Guide says that shared memory can provide 32*8 = 256 bytes per cycle per SM if using 64-bit mode. Is it possible to use this 64-bit mode for L1 cache?
You can change the shared memory with cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte) but I don’t think there is an equivalent for L1.
Let me ask around.
Any results from asking?