L2 cache (.cg) memory load performance

Hi all, I have been doing some tests with interleaving loads through different cache levels, and I’ve been getting some results I don’t entirely understand. I am using CUDA C++ 8.0 in VC++2015.

I’m running this code on a 980ti, which from what I’ve researched should let me explicitly use or bypass the L1 cache.

I am using two functions wrapping around PTX instructions for explicit cache usage.

__device__ int loadThroughL1Cache(int* p)
{
	int out;
	asm("ld.global.ca.s32 %0, [%1];" : "=r"(out) : "l"(p));
	return out;
}

__device__ int loadThroughL2Cache(int* p)
{
	int out;
	asm("ld.global.cg.s32 %0, [%1];" : "=r"(out) : "l"(p));
	return out;
}

From the PTX ISA documentation, the .cg flag should explicitly disable usage of the L1 cache, and the .ca flag should explicitly use L1 cache.

I’ve been trying to use these to stop one low hit rate read from thrashing the L1 cache, so that another high hit rate read gets better L1 hit rates.
Here’s my full test code:

#include <cstdio>
#include <ctime>

#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

// 128MB
#define count 1024*1024*128

#define WARP_SIZE 32

#define BLOCK_SIZE 32

__device__ int* ptr;

__device__ int* cudaOut;

int* workspace;

int* output;

void checkError(cudaError_t err) {
	if (err == cudaSuccess)
		return;
	const char* error = cudaGetErrorString(err);
	printf(error);
}

__device__ int loadThroughL1Cache(int* p)
{
	int out;
	asm("ld.global.ca.s32 %0, [%1];" : "=r"(out) : "l"(p));
	return out;
}

__device__ int loadThroughL2Cache(int* p)
{
	int out;
	asm("ld.global.cg.s32 %0, [%1];" : "=r"(out) : "l"(p));
	return out;
}

// Result: the overhead
// each thread processes a 1024-byte block of ram
__global__ void interleavedCacheLoadsTest1(int* area, size_t size, int* out) 
{
	int px = (blockIdx.x * blockDim.x) + threadIdx.x;
	int val;

	// 1st half has "random" accesses
	// over 128m of memory, no threads read same area
	if (px >= count)
		return;

	int offset = px * 16384;
	int offset2 = (px % 17) + (px % 3) + (px % 47) + (px % 13);

	for (int r = 0; r < 1; r++)
	{
		for (int i = 0; i < 16384; i++)
		{
			// When this runs through L1 cache, it has 0% L1 hit rate
			val += loadThroughL2Cache(&area[offset + ((i) % 1024)]);
			// When this runs through L1 cache, it has 100% L1 hit rate
			val += loadThroughL1Cache(&area[(offset2 + (i * offset2)) % 4096]);
		}
	}

	out[px] = val;
}

int main()
{
	cudaDeviceSynchronize();

	workspace = (int*)malloc(count*sizeof(int));

	for (int i = 0; i < count; i++) {
		//16 bits set
		workspace[i] = 0x000000FF;
	}
	
	checkError(cudaMalloc(&ptr, count*sizeof(int)));
	checkError(cudaMalloc(&cudaOut, count*sizeof(int)*sizeof(int)));
	checkError(cudaMemcpy(ptr, workspace, count*sizeof(int), cudaMemcpyHostToDevice));

	dim3 grid = dim3(count / ((WARP_SIZE*WARP_SIZE)* 16384) + 1, 1, 1);
	dim3 block = dim3(WARP_SIZE*WARP_SIZE, 1, 1);
	time_t time;

	cudaDeviceSynchronize();
	checkError(cudaGetLastError());
	time = -clock();
	interleavedCacheLoadsTest1 << < grid, block >> >(ptr, count, cudaOut);
	checkError(cudaDeviceSynchronize());
	time += clock();
	printf("Time: %f\n", ((float)(time) / CLOCKS_PER_SEC));
	__debugbreak();
}

[/i]
I have designed line 64 to thrash the L1 cache. If you comment out line 66 and modify line 64 to use the L1 read function, it gets a 0.1% hit rate.
I have designed line 66 to perfectly cache. If you comment out line 64, you will see the L1 read gets 100% L1 cache hit rate.
HOWEVER, when these reads are interleaved as is in the unmodified code, I only get a 15.8% L1 hit rate. If the .cg instruction really is avoiding the L1 cache, shouldn’t line 66 still have 100% L1 cache hits? It seems like the .cg instruction is still loading through L1 cache…

Equally strange, I found that if you comment out line 66, line 64 will actually run substantially faster if loaded through the L1 cache, despite there being virtually no L1 cache hits… which seems to imply there is more overhead in not checking the cache?

I’m wondering if this has something to do with Maxwell having a shared texture/L1 cache? I don’t know if that would affect any of my assumptions on what does/doesn’t count as an L1 cache load?

I had to correct a few trivial mistakes to make the code compile - is this the actual code you are running?

Note that all memory sizes are 4× larger than the comments suggest because you are using ints, not chars.

Also the code is just running 9 blocks because of the weird division by 16384 in the grid size calculation, is that intentional?

What happens if instead of loadThroughL1Cache () you use __ldg() to load via the texture cache?

Yep - this is the actual code. It compiles fine on my machine? Yup - the comments in the code are outdated. The weird division is also intentional. Didn’t explicitly plan for 9 blocks, but I’m guessing that’s probably enough to roughly test cache performance, at least as far as cache hits and misses go.

Just tested using the texture cache:

__device__ int loadThroughTextureCache(int* p)
{
	return __ldg(p);
}

I ran each of these tests 3 times and averaged the result.

Tests for line 64 (0% cache hit function):
66 commented out, running 64 through L2 cache only:
~0.059ms
66 commented out, running 64 through L1 cache:
~0.037ms
66 commented out, running 64 through texture cache:
~0.055ms
This is still strange. Even with a 0.0% L1 hit rate, the load still runs faster through L1 cache… however, only a minimal gain comes from using the texture cache (which again, has a 0% hit rate…).

Tests for line 66 (100% cache hit function):
64 commented out, running 66 through L2 cache only:
0.009
64 commented out, running 66 through L1 cache:
0.007
64 commented out, running 66 through texture cache:
0.004
Nsignt performance analysis claims that there’s still ~18% L1/Texture cache hits when using loadThroughL2Cache, which is strange? Looking deeper in nsight, it looks like the .cg loads are going through some cache, but the hit rate is much, much lower than if I use __ldg.

Either way, I get ~15.8% cache hits with:

// When this runs through L1 cache, it has 0% L1 hit rate
val += loadThroughL2Cache(&area[offset + ((i) % 1024)]);
// When this runs through L1 cache, it has 100% L1 hit rate
val += loadThroughL1Cache(&area[(offset2 + (i * offset2)) % 4096]);

and ~35.5% cache hits with:

// When this runs through L1 cache, it has 0% L1 hit rate
val += loadThroughL2Cache(&area[offset + ((i) % 1024)]);
// When this runs through L1 cache, it has 100% L1 hit rate
val += loadThroughTextureCache(&area[(offset2 + (i * offset2)) % 4096]);

And both give 100% cache hits with line 64 commented out

Another thing I came across, if you change line 66 to use % 8192, using the texture cache gets more cache hits than L1, which would seem to imply it might be larger? Which seems strange since I believe the Maxwell documentation states the L1 and texture cache use the same memory…?

I also get compile errors on linux.

The code is not portable to linux as-is. It probably compiles OK on windows/VS

These are the changes I need to compile (on Linux, CUDA 8.0):

--- orig.cu	2017-01-04 22:17:17.070933610 +0000
+++ cache.cu	2017-01-04 22:18:56.949416278 +0000
@@ -12,9 +12,9 @@
 
 #define BLOCK_SIZE 32
 
-__device__ int* ptr;
+/*__device__*/ int* ptr;
 
-__device__ int* cudaOut;
+/*__device__*/ int* cudaOut;
 
 int* workspace;
 
@@ -78,7 +78,7 @@
 
 	for (int i = 0; i < count; i++) {
 		//16 bits set
-		workspace = 0x000000FF;
+		workspace[ i ] = 0x000000FF;
 	}
 	
 	checkError(cudaMalloc(&ptr, count*sizeof(int)));
@@ -96,5 +96,5 @@
 	checkError(cudaDeviceSynchronize());
 	time += clock();
 	printf("Time: %f\n", ((float)(time) / CLOCKS_PER_SEC));
-	__debugbreak();
+	// __debugbreak();
 }

[Later:]
Turns out the culprit on the missing array index [ i ] is the forum software. Had to trick it around taking that to start using italics.

BTW using an allocation size of count*sizeof(int)*sizeof(int) to avoid an out-of bounds access is quite creative… Also the cuda includes are unnecessary in a file with .cu extension.

I am suspicious of the sequential nature of the reads on line 64.

Given that the compiler automatically unrolls the loop and reorders multiple loads together, this means that subsequent loads will request transactions that are already in flight. Whether the cache is completely thrashed in between or not, this is likely to be beneficial (I presume SMs have logic to catch reuse of a transaction in flight before it hits the L1 cache, otherwise L1 cache wouldn’t be very efficient at all).

Furthermore, going through the L1 cache increases the transaction size from 32 bytes to 128 bytes, which means that more data will be “read ahead” by reusing transactions already in flight.

Unfortunately I don’t have compute capability 5.2 hardware available to test my suspicions myself. Do timings change if you introduce a “#pragma unroll 1” before the inner loop on line 61? Do timings change if you replace line 64 with

val += loadThroughL2Cache(&area[offset + ((i*123) % 1024)]);

?

I’m doing a sequential read on line 64 so that most of the transfers get stored in L2 cache.
Changing it to:

val += loadThroughL2Cache(&area[offset + ((i*123) % 1024)]);

With line 66 using texture cache:
The L1/texture hit rates drop to 30.7%, so I guess this is further evidence that the .cg load is still going through the cache.
The L2 cache hit rate drops from 92.5% without the multiplication, to 27% (which makes sense since it’s no longer sequential).
With line 66 using the L1 cache:
L1 cache hit rate stays at 15.8%, pretty much identical to not multiplying i.
The L2 cache hit rate drops from 92.5% to 40.2%.

Sure enough, for both of those there are massively more device memory reads due to L2 cache misses from no longer reading sequentially.

The compiler definitely was unrolling the loop, and putting that limit on the loop unrolling slows thingjavascript:void();s down, particularly with the sequential access. However I noticed the same load instructions are used in both the PTX and SASS regardless of whether you unload the loop or not, with both being 32-bit loads. I might try modifying the code to do less linear reads in a smaller space that fits in L2 cache, and see what happens.