Launch kernel one time is slower than lower kernel size

launch kernel with block=20498,threads=64,regsperthreads=255 cost 100ms,with a[20498];
launch kernel 16 times with block=1282,threads=64,regsperthreads=255 cost 2ms,with a[16][1282]
is three time faster than launched once
device is rtx2080
anyone know the reason?
I find it is related to smswarpsizeregspersm,any one knows how it worked?