How can I make Quadro K420 skip L1 and L2 caches when loading a variable?

tugrul_192bit · April 7, 2018, 11:08pm

I think one of my algorithms disturb cache space of some variables that are used by multiple cores by loading too much data (only once per thread and not other threads). How can I stop it? From PTX or some compiler option for some kernel parameters?

Even though it has only 256kB L2 cache, some ordering(z?) support should make it good enough but unfortunately some other one-time-used variables are loaded through L1(and L2?) and they are like 800MB or something and probably making caches not very helpful. I know, I need to have a better algorithm but that will take time and I just wonder this L1-L2 bypassing ability.

Robert_Crovella · April 7, 2018, 11:27pm

The L2 activity is unavoidable. All data retrieved from device memory flows through the L2. You cannot bypass it.

L1 should not actually be enabled for global loads on Kepler Quadro K420

The L1 can be bypassed with a non-caching load. There are plenty of descriptions of this available on the web if you care to search for them.

[url]Kepler Tuning Guide :: CUDA Toolkit Documentation

On cc 3.5 and higher devices (your is cc 3.0) the usual advice for read-once data is to load it through the read-only cache. This still impacts L2 however.

tugrul_192bit · April 7, 2018, 11:33pm

As read-only cache, I should use textures and constant memory right? Just putting a const(and restrict) before a parameter didn’t make a difference for my setup.

I’m using Nvrtc and driver API so maybe I can embed some data on directly string of kernels? They come from instruction cache maybe? I mean, replicating something as if compiler is doing it for me, as a an array of operations. For example, having 800MB of streaming through inside kernel, without any parameters.(but I’m staying away for now from this)

Robert_Crovella · April 8, 2018, 1:01am

Correct, for devices of compute capability less than 3.5, your options for read-only traffic optimization are texture and constant

on devices of cc 3.5 and higher, decorating a global pointer with const restrict should be a strong hint to the compiler to use the “read-only” cache for that load traffic.

Topic		Replies	Views
Switch off L1 cache CUDA Programming and Performance	2	3324	March 24, 2015
variable cache line width ? CUDA Programming and Performance	4	1925	January 13, 2015
Anyway to force several bytes to be in L1/L2 cache so that I can use it across multiple threadblocks within one kernel? CUDA Programming and Performance	2	426	June 24, 2022
K20 vs C2075 L1 cache CUDA Programming and Performance	4	1378	July 11, 2013
Cache L1 and L2 Architecture Kepler CUDA Programming and Performance	2	3154	December 30, 2019
Do 7.x devices have a readonly constant cache? CUDA Programming and Performance	4	1458	July 30, 2022
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3039	December 11, 2014
What is the benefit from LDG or LDG128? CUDA Programming and Performance	3	620	January 17, 2024
Texture Reads What is the source of performance increase? CUDA Programming and Performance	3	5922	March 9, 2011
L1 and L2 cache hit rate CUDA Programming and Performance	8	6449	February 3, 2016

How can I make Quadro K420 skip L1 and L2 caches when loading a variable?

Related topics