Do you have an example of that? The only way it would make sense is if there were:
-
multiple warp schedulers per SMSP
<or> -
the warp scheduler can issue more than 32 threads/clk
AFAIK there is no such GPU that has an SM subdivision into two or more SMSPs with each SMSP having 64 CUDA FP32 cores, and also has either a warp scheduler with more than 32 threads/clk issue rate, or multiple warp schedulers per SMSP. So I have no answer. There is no such animal.