Kernel seems to be slower when using python runtime instead of cpp runtime on orin

I wrote a GEMM kernel, and when I tested it using a C++ unit test, the execution time was 9.8 ms.

However, after integrating it into PyTorch and using pytest for unit testing and profiling, the execution time increased to 12 ms. What could be the possible reasons for this?

Furthermore, I used ncu to profile both cases and found that the memory bandwidth had decreased substantially.

And more precisely

The machines used for both tests were identical, running in the same Docker environment, and I had enabled MAXN mode and used jetson_clocks with

sudo nvpmodel -m 0
sudo jetson_clocks

And this my orin info with jtop output

Hi,
Looks like you are using Jetpack 6.0GA. We would suggest upgrade to latest Jetpack 6.1 and try.

I didn’t find any related updates in jetpack6.1 release notes. JetPack 6.1 Release Notes — JetPack 6.1 documentation

I have tried it with jetpack6.1, it didn’t work.

Moreover, when I using nsys to profile the whole program which call this kernel by pytorch, I found that that the GPC frequency and GPU bandwidth are unstable and vary over time.
Here is a screenshot.

As you see, the GPC clock fluctuates between 600 and 1300 during the certain kernel execution, while during the time this kernel is not executed, the GPC clock remains stable at around 1300.