Speedup Trend With Increasing Blocks... Trouble Interpreting Results

EDIT: So, I forgot to attach a picture, and now, for some reason I reloaded the data into excel and the peak no longer shifts from regular math to fast math…

But, here is the picture now

Hi all,

I am trying to interpret the speedups I obtained as a function of the number of blocks sent to the GPU.
The occupancy calc says I have 8 blocks per MP, with 64 threads per block.

If I increase the number of blocks I thought I would have an increasing level of performance vs the serial, one-threaded CPU code which scales linearly with increasing elements to calculate (the same ‘elements’ that lead to the threads used in the block, so its a one-for-one comparison).

Instead I get the attached figure.
Please note that the option with fast-math peaks at 768 blocks, while the other is at 384 blocks.

Does anyone have an idea (or suggestions for what I should look further in to) why there is that peak at 384/768 blocks?

The fast math option results in no local mem being needed, the non-fast math one uses 168 bytes of lmem, if that matters.
The data points with < 384 points were run with numbers of elements such that the last block is partially not-filled, and since my kernel has something like "if (threadIdx.x+blockDim.x*blockIdx.x) <total_number_elements), there will be divergence added. At first this was partially the culprit, until I saw the fast math result.

Both codes are compiled with -arch sm_13
I am running on a GTX 275 (30 MPs).

Thanks very much,