Why is the RTX 4060 slower than the GTX 1660 Ti for basic data computation?

I want to perform envelope operations on the data,you can see the figure below. I ran this code on the RTX 4060 and the GTX 1660 Ti. When running the code, I used the same CUDA version and working environment and allocated the same grid and block
On the GTX 1660 Ti, it cost 0.0698ms, whlie it cost 0.122ms on the RTX 4060. Although I changed grid and block such as (32,1),(128,1),(128,4)…It still run slower than the GTX 1660 Ti.

please don’t post pictures of code on these forums. Instead, post the code as text, properly formatted using the code button (</>).

I got! These are two different PC. In addition to the GPU, there should be other hardware factors. And I tested them on the same PC, the result is reasonable.

Thank you for your reminder. Sorry, this is my first time posting.