I think one of my algorithms disturb cache space of some variables that are used by multiple cores by loading too much data (only once per thread and not other threads). How can I stop it? From PTX or some compiler option for some kernel parameters?
Even though it has only 256kB L2 cache, some ordering(z?) support should make it good enough but unfortunately some other one-time-used variables are loaded through L1(and L2?) and they are like 800MB or something and probably making caches not very helpful. I know, I need to have a better algorithm but that will take time and I just wonder this L1-L2 bypassing ability.