With the new architectures 2.X this feature is (i think) less important, but the best tutorial is the CUDA programming guide.
In the v3.2 section G.3.2.2 i learnt the coalescing concept and its influence over the performance.
In this same guide, you have few examples in the page 164. These examples, not only show the number of transactions by halfwarp else the size of these transactions too.
Finally, other good tutorial is the vectorAdd example. Try to modify the stride of access to the vector words to make it uncoalesced and you will check how much affect the uncoalesced acceses to the performance. For example: sum the elements of the vector with a stride of 3.
In all your proofs, perhaps you need a leaf and a pen .