Kernel performance difference with different pointers

I don’t know if you can answer this with such limited context but…
I’m writing a convolution kernel in CUDA C++ and then using it in python with a dll… the general stuff…
The python script has a block of code…

In this kerngrad function there is a pointer temp_c to device memory and data is written to its address in this kernel. Problem is I am getting very low performance when I pass this pointer (which I actually need there) and I get the correct speed if I pass kgrad_c (I know that’s the correct speed coz I unit tested it). Both pointers are immediately used in the next kernel so this shouldn’t be a dependency issue.

63ms → 2.78s… Both pointers are initialized from the same call cudaMalloc so no issue of unified memory there…

(self.bsize = 67502532 and data_buffer_size = 67518464, so both pointers can accommodate self.bsize)

The nvvp says kernel is accessing system memory even though I’m not using cudaMallocManaged… I do have free memory on the device anyways… Why would it go to system memory?

What could be the issue here ?
Am I missing any concept ?
TY :D.