Software prefetch at kernel level

continuing from this thread: Tuning a kernel with LDG(ON/OFF,array) and prefetching .
(1) It’s been 4 years with a lot of updates, I wonder if anyone has successfully improved a kernel performance with software prefetch.
(2) How will hardware prefetch happen? Does it detect the memory loads automatically and do it parallelly with the computation parts?