A GT 240 (sm_12, 12 SMs) reports a similar global load/store efficiency number (24%).
Fermi and Kepler devices report 100%.
Example code here.
Update: I dug a little deeper into the global ld/st efficiency numbers for sm_12 devices and was just as confounded as you. If you dig deeper into the Visual Profiler and collect Metrics & Events you can capture gld/gst 128/64/32b events as well as total requests and coalesced transaction counts. None of these metrics point to low efficiency.
Update 2: I am pretty sure that for sm_12 you should be interpreting the gld/gst_efficiency as a strict device-specific ratio and not as a percentage. The target number you should strive for is “2 * #SM”. For a GT 240 it is 24 and for the 1800m it is 18 (9 SMs). Anything less implies gld/gst requests were “fragmented”. I assume sm_13 mirrors sm_12.
How did I come to this conclusion? Force some uncoalesced loads or stores in your microbenchmark and inspect the 128/64/32b event counters. Then plug them into the documented formula. They match the reported ‘efficiency’ ratio.
So is this actually a bug in Visual Profile? Yeah, I think so. The formula in the docs mirrors what VP is reporting but if it’s supposed to be a percentage then the “(2 * #SM)” should simply be removed.
For sm_11 the efficiency number’s range appears to be between 0.0 and 1.0. Not a percentage either – so also a minor bug. 1 means all transactions were coalesced, otherwise it’s the ratio of coalesced/(coalesced+uncoalesced). This matches the formula in the VP documentation and is at least close to being a percentage. Note that Visual Profiler doesn’t reveal any transaction size counters for sm_11 devices.
I’ll now let the old headless GT 240 and 9400 GT cards go back to sleep. :)