Benchmarking inconsistancies with multiply and subtract

I’m trying to do some benchmarking on basic filters and ran into some strange timings.
Code is run on a gtx 285 and 1800 x 1000 single precision floating point buffers (didn’t bother initializing values, don’t currently see why it should matter)

Calculating “a = b - c” I get 1.2 ms per frame. Doing a = b*b gives me for about half the runs 0.65 ms per frame and half the time 0.18 ms per frame. The upper limit is more consistent when running from inside visual studio using f5, especially if there is any breakpoint in the code (doesn’t seem to matter where), the lower limit is more consistent running from the console.

The lower value seems to be consistent more or less with the memory bandwidth, the upper limit with the subtraction results.

I tried running this in the profiler and it looks like all transfers are gld64b and gst64b. No local memory, 3 tlb misses (no idea where constant and shared memory play a role with these simple kernels)
Time in the profiler for multiply is 163.7, for subtraction it is 176.3

Any ideas what these time inconsistencies are?