Looking at this exerpt from the Ampere Whitepaper (note GA10X or Compute 8.6):
“2x FP32 Throughput
In the Turing generation, each of the four SM processing blocks (also called partitions) had two primary datapaths, but only one of the two could process FP32 operations. The other datapath was limited to integer operations. GA10X includes FP32 processing on both datapaths, doubling the peak processing rate for FP32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores, and is capable of executing either 16 FP32 operations OR 16 INT32 operations per clock. As a result of this new design, each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.”
my take is that both Volta/Turing and Ampere 8.6 can issue 128 ops/clk. However, if you are primarily FP32, Ampere will potentially issue more ops/clk due to each partition having FP32 in each path, unlike Volta which only has FP32 in one.