Read-only-data cache, only for Tesla or also for GTX 680?

According to the Kepler whitepaper, there seems to be a new cache available, the read-only data cache of 48 KB. Is this available for the GTX 680 as well, or only for Tesla cards?

The read-only cache is just the texture cache. The GK110 has a special bindless texture feature which means the just by labelling a pointer as being const __restrict the compiler will automatically read the data through the texture cache. This feature does not exist on GK104 / GK107, but you can of course just bind to a texture and get the same benefit, albeit with slightly more work. The texture cache is 48 KiB per SM regardless whether it is GK104, GK107 or GK110.

Please refer to the following presentation from GTC 2012 (in particular, slide 32):

In order to transparently read data through the texture cache, the data must be read-only for the entire duration of the kernel (since the texture cache is not coherent with underlying storage). The compiler must be able to establish that this is definitely the case before generating the new GK110 instruction for access through the texture cache. Appropriate use of “const” and “restrict” helps the compiler with this task. A const pointer indicates that the data is read-only when accessed through this pointer, the use of restrict is an assertion by the programmer that the data is not accessed (and thus potentially written) through any other pointer.

Note that the existing modifiers const and restrict can be applied to pointers passed to any function, but that the compiler must establish the read-only nature of a piece of data for the entire duration of the kernel, i.e. across all functions called by that kernel, before it can read data through the texture cache automatically. The mere presence of const restrict pointers thus does not guarantee reads through the texture path, but it helps facilitate such reads.

The use of const and restrict where applicable is helpful for enabling compiler optimizations on pre-Kepler GPUs as well (see the Programming Guide and the Best Practices Guide), as they provide the maximum amount of usage information to the compiler. Just the other week I was able to speed up a small application by 16% on a Fermi platform, simply by adding const and restrict to kernel pointer arguments. In some cases one may observe slowdown from this technique, when the additional freedom given to the compiler leads to increased amounts of common subexpressions, which in turn increases register pressure. Such scenarios are fairly rare in my experience.

Is the texture cache really 48 KB? The CUDA programming guide says “Cache working set per multiprocessor for texture memory: Device dependent, between 6 KB and 8 KB” in Table 10.

And the GK110 whitepaper says 48k. I’ve always been skeptical of the programming guide’s number for tex cache, so I think the whitepaper number is correct.

Yes, it is 48 KiB on all Kepler GPUs (GK104 / GK107 / GK110). This somewhat lessens the impact of the non-increase in shared memory. With care, you can devise kernels that use both the texture cache, L1 cache and the shared memory. In total, each SM has 368 KiB of local storage (registers + shmem + L1 + tex) which is a reasonable amount of resources.

I tend to write code that allows these decisions to be made at compilation time, e.g., with templates or the C pre-processor, and then tune performance accordingly. Often what is best on one architecture isn’t the same as another.

I have a K20, and am trying to implement texture cache reads. I have contrived a simple example, but dont know the syntax to make const restrict work. In this example, I set up a structure and a global device array. I then have a parent kernel put values into my global array. I want the baby kernel to read the global array via texture constant read cache. Or, alternatively, through the __ldg command.

Questions:

  1. How do I modify babyKernelConstantRestrict to implement (const restrict texture reads)?
  2. How do I modify the __ldg function so it will work in babyKernelLDG? Now suppose myStruct was more than 8 bytes – say a random number such as 47. Could __ldg be modified for this case?
// static __device__ __inline__ double2 __ldg(const double2 *ptr) { double2 ret; asm volatile ("ld.global.nc.v2.f64 {%0,%1}, [%2];"  : "=d"(ret.x), "=d"(ret.y) : __LDG_PTR (ptr)); return ret; }
struct __align__(8) myStruct {
	uint32_t varA;
	uint8_t varB;
	uint8_t varC;
};

__device__ myStruct structArray[100];

__global__ void babyKernelConstantRestrict()
{
	myStruct tVar = structArray[threadIdx.x];
	printf("threadIdx.x %i varA %i varB %i varC
",tVar.varA, tVar.varB, tVar.varC);
}

__global__ void babyKernelLDG()
{
	myStruct tVar = __ldg(structArray[threadIdx.x]);
	printf("threadIdx.x %i varA %i varB %i varC
",tVar.varA, tVar.varB, tVar.varC);
}

__global__ void parentKernel()
{
	for (uint32_t i = 0; i < 100; i++)
	{
		myStruct tmp;
		tmp.varA = i;
		tmp.varB = i;
		tmp.varC = i;
		structArray[i] = tmp;
	}
	babyKernelConstantRestrict<<<1,100>>>();
	babyKernelLDG<<<1,100>>>();
}

void main()
{
	parentKernel<<<1,1>>>();
}

Hi DrAnderson,

Can you share your code of using texture cache? Because I can only detected 12KB texture or read only cache. I don’t know what’s wrong with my code.

Thanks

Sure, my code is open source: http://codeblue.umich.edu/hoomd-blue/

But the tex cache usage is nothing special. Just cudaBindTexture and tex1Dfetch. I tried __ldg on K20 with no noticable performance difference. Since I have an established code and need to support G200 and GF100, I just left everything as tex1Dfetch.

I haven’t done any microbenchmarking to try and determine the tex cache size. But my kernels that tax the tex cache heavily do run 2x faster on GTX 680 than on 580.

48 KB seems like a strange number, is perhaps actually 64 KB where 16 KB is used for something else?