can 16-bits and 32-bits Native Arithmetic Instructions run independently ?

in programming guide,
Table 2. Throughput of Native Arithmetic Instructions. (Number of Results per Clock Cycle per Multiprocessor)
Compute Capability
3.0, 3.2 3.5, 3.7 5.0, 5.2 5.3 6.0 6.1 6.2 7.x
16-bit floating-point add, multiply, multiply-add N/A N/A N/A 256 128 2 256 128
32-bit floating-point add, multiply, multiply-add 192 192 128 128 64 128 128 64
64-bit floating-point add, multiply, multiply-add 8 642 4 4 32 4 4 32


quesion 1 : 16-bit floating-point add, multiply, multiply-add can run 128 operations totally (x+y+z=128) or seperately (x=y=z=128) ?

quesion 2 : whether 16-bit floating-point Instructions use the same hardware as 32-bit floating-point Instructions, so these instructions can conflict.

thanks in advance.

You can achieve the 128 number with a mix of instructions (x+y+z=128). You do not get 128 instructions per SM per clock throughput for each of add, multiply and multiply add simultaneously.

I believe the answer here varies by GPU type.

My understanding is that cc6.0 may use the same hardware, but that the other cc have independent hardware for 16-bit and 32-bit

As far as I know this information (question 2) isn’t published, so YMMV and you would need to confirm as best you can via microbenchmarking or review of microbehchmarking.

For example, this microbenchmarking paper:

https://arxiv.org/pdf/1804.06826.pdf

claims that on pascal, instructions like HFMA2 and FFMA have the same latency, whereas the latency is different between those 2 on volta. This is a clue that on pascal they may be using the same unit whereas on volta they appear to be not using (exactly) the same hardware.