cycles of each instruction

Hi all,

Is there any document in which we can find the number of cycles of execute a ptx instruction on a specific hardware?
Especially for memory instructions.
Or is there any benchmark to measure these? I want to know how many cycles it takes to serve one memory transaction,
and the departure delay between two consecutive memory transactions.
Is the difference between coalesced and uncoalesced accesses that in coalesced, only one transaction is issued, and
in uncoalesced, 16 transactions are issued, so the time is one transaction + 15 departure delay.

Thanks,