However, after integrating it into PyTorch and using pytest for unit testing and profiling, the execution time increased to 12 ms. What could be the possible reasons for this?
Moreover, when I using nsys to profile the whole program which call this kernel by pytorch, I found that that the GPC frequency and GPU bandwidth are unstable and vary over time.
Here is a screenshot.
As you see, the GPC clock fluctuates between 600 and 1300 during the certain kernel execution, while during the time this kernel is not executed, the GPC clock remains stable at around 1300.