for architecture sm_60, and I suspect for architectures sm_53 (TX1), and sm_62 (TX2), AFAIK, the FP16 throughput is implemented via a special mode on the FP32 cores. Therefore, in a given instruction cycle, an FP32 instruction could be scheduled on that core, or an FP16 instruction (processing half2), but not both.
Support for the sm_60 claim is here:
At least on sm_60, according to this claim, the FP32 and FP16 cores are the same functional unit within the SM. I’m not sure you’ll find much actual NVIDIA documentation or specification to this effect. It could possibly be confirmed via microbenchmarking, although probably pretty difficult to do so.
Architectural arithmetic throughputs can be determined from this table:
but they are not all guaranteed to be achievable at the same time.