L1 cache hit rate too low

I am attempting to prefetch instructions into the L1 cache using the prefetch.global.L1 instruction from PTX, but when I inspect the Global Load Hit Rate with Nsight Compute, there is no improvement; it remains at 0.07%. Meanwhile, the L2 cache hit rate has increased from 1% to 37%. What could be the possible reason for this?


I have used the __syncthreads() function between prefetching and loading to ensure that the prefetch operation is completed.