I just received a T10P board in the mail today, and was playing with the native double support in one of our kernels. While looking at profiler output, I noticed that the shared memory broadcast mechanism only works for 32-bit types. When I flipped one of our shared arrays from float to double, suddenly the warp_serialized counter was non-zero.
(Reading through the programming guide section on shared memory, it clearly states that broadcast applies to 32-bit words. So, this is documented behavior even though it surprised me.)
Is broadcasting a double to an entire half-warp equivalent to a 16-way bank conflict?