when I profiled my cuda program using nsight systems, I always found ampere_sgemm_128x128_nn in the nsys window. I was confused that how my kernel was executed in cuda level. Was it decomposed into several kernels such as ampere_sgemm_128x128_nn ? BTW, where could i find some references about these kernels

It’s probably coming from a cublas call, or a library that uses cublas, like cudnn. You won’t find these kernels documented anywhere.