Another possible trick:
Eliminate the divergence. Just do the write even if i is equal to ip. Prefetch a single element from src before you start oper for the current src element.
__device__ void kernel(double *out, double *src)
i = threadIdx.x + blockIdx.x * blockDim.x;
double currentsrc = src;
for(ip = 0; ip < N; ip++)
nextsrc = src[ip+1]; //L1 would load 128 bytes at a time. So it doesn't cost anything for 15 out of 16 loops.
//If oper is long enough, the load that does occur would be completely hidden as well.
//do remember to allocate 8 more bytes for src so that there would be no memory access problem
out[i] += oper(i,ip) * currentsrc * constants;
currentsrc = nextsrc;
The above method generates rather balanced global read accesses. To make it even more balanced you’ll have to create some divergence between warps, though that would be unnecessary in your case, I believe.
I think you could try putting src in constant memory. This way the compiler will certainly generate Load Uniform instructions, which are supposingly optimised for this kind of uniform access that is taking place in all your threads. Considering the fact that your kernel is probably compute-bound(oper is long), the memory part actually wouldn’t matter much.
You will want to use shared memory for pre-fetching only when you want to prefetch multiple elements at one go and you do not have enough registers to store them. Do check the cuobjdump output to ensure that nextsrc doesn’t get thrown into local memory. Of course, when your oper has sufficient ILP, you can actually reduce the number of threads per block and increase the number of registers per thread, though I doubt you will actually be able to do that with all the meddlesome optimisation of ptxas.