About the number of CUDA cores in SMSP, less or gerater than warp threads number(32)

Yes

Do you have an example of that? The only way it would make sense is if there were:

  • multiple warp schedulers per SMSP

    <or>
    
  • the warp scheduler can issue more than 32 threads/clk

AFAIK there is no such GPU that has an SM subdivision into two or more SMSPs with each SMSP having 64 CUDA FP32 cores, and also has either a warp scheduler with more than 32 threads/clk issue rate, or multiple warp schedulers per SMSP. So I have no answer. There is no such animal.