L2 transfer overhead (profiler bug?)

Since I have already formulated this on SO, allow me to paste the link to the question: disassembly - Cuda L2 transfer overhead - Stack Overflow

The awesome Scott Gray @scottgray76 has pointed out to me on Twitter that this might be a bug in the profiler and that those numbers could be bogus.

Does anybody see this differently? Am I missing something here?

Thank you