Still trying to understand why the optimal number of blocks is not a multiple of the number of SMs on the 285. Would someone mind running this on on one of the higher end NVidia card and sending me the avg.csv output file?
Sorry this and the question are repeated here and in the general computing section, but I finally figured out why I was observing such different behavior at different block sizes.
It ends up the performance was really the result of the starting address alignment for each block. By forcing the address that each block started on to a 64 byte (16 4 byte floats) boundary, I now get much more predictable results. Here is the graph of the performance verses the number of blocks.