rtBuffer<float4, 3> vs rtBuffer<float, 3> access performance

I’ve been reading up on memory coalescing and was curious if accessing data as a rtBuffer<float4, 3> vs rtBuffer<float, 3> would have any noticeable performance difference. From first inspection I would think that asking for a single index from the float4 buffer would basically grab all 4 values at once requiring less memory requests. However is the compiler smart enough to grab as many consecutive memory addresses as possible in the float case? I know testing it out (which I’m about to do) is the concrete answer but I’d also like to gain insight on the conceptual level as well.

uint3 index2  = make_uint3( launch_index.x, launch_index.y, 0 );
for (size_t i = 0; i < numfloats; ++i) {
   float_buffer[index2] += 1.f;
   ++index2.z;
}

vs.

uint3 index  = make_uint3( launch_index.x, launch_index.y, 0 );
for (size_t i = 0; i < float4Groups; ++i) {
   float4_buffer[index] += make_float4(1.f, 1.f, 1.f, 1.f);
   ++index.z;
}

-edit-
So on my hardware the float loop takes about 3210 clock cycles and 1978 in the float4. So that’s about 38% less clock cycles. I guess that is a pretty good performance boost.