Cuda SMs comprise 4 Partitions. Those have independent arithmetic units=cores.
Warps running on a SM are each assigned to a specific partition.
Each partition can schedule an instruction every clock cycle.
Shared memory is accessed over the LSU. It is a single resource shared by all 4 partitions.
Typical speeds from the last few architecture generations are mostly 1 (up to 2 cycles) per access / SM. There is no time-sharing (not multiple ports for each SM partition).