Instruction cache and instruction fetch stalls

aroop · June 23, 2019, 5:16am

Hi,

I’m looking for some information about how the instruction cache works. My GPU is GTX 1070ti (Pascal). Code is compiled with CUDA 10.1.
Here’s my observation:

I’ve created a kernel that represents a loop with a long body. It doesn’t perform any global reads.

__device__ __forceinline__
uint2 ROTATE(const uint2 a, const uint32_t offset)
{
	uint2 result;
	asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.x) : "r"(a.y), "r"(a.x), "r"(offset));
	asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.y) : "r"(a.x), "r"(a.y), "r"(offset));
	return result;
}

__device__ __forceinline__
uint2 ROTATE24(const uint2 a)
{
	uint2 result;
	result.x = __byte_perm(a.x, a.y, 0x6543);
	result.y = __byte_perm(a.y, a.x, 0x6543);
	return result;
}

__device__ __forceinline__
uint2 ROTATE16(const uint2 a)
{
	uint2 result;
	result.x = __byte_perm(a.x, a.y, 0x5432);
	result.y = __byte_perm(a.y, a.x, 0x5432);
	return result;
}

__global__
__launch_bounds__(256, 2)
void my_kernel(uint32_t* out)
{
	uint32_t thread = blockDim.x * blockIdx.x + threadIdx.x;

	uint2 data[4];
	data[0] = make_uint2(thread, thread);
	data[1] = make_uint2(thread + 5, thread + 5);
	data[2] = make_uint2(thread + 15, thread + 15);
	data[3] = make_uint2(thread + 20, thread + 20);

	#pragma unroll 1
	for (int t = 0; t < 64; t++)
	{
		#pragma unroll
		for (int i = 0; i < 400; i++)
		{
			data[0] = ROTATE(data[0], 63);
			data[1] = ROTATE24(data[1]);
			data[2] = ROTATE16(data[2]);
			data[3] ^= data[2];
		}
	}

	if (data[0].x ^ data[1].x ^ data[2].x ^ data[3].x == 0 &&
		data[0].y ^ data[1].y ^ data[2].y ^ data[3].y == 0)
		out[0] = thread;
}

When I run it under nvprof, it shows 13% instruction fetch stalls. However, when I compile the module with “-Xptxas -dlcm=ca” flag the instruction fetch stalls drop to 5.7% and I see a performance improvement of ~4%.
I inspected cuobjdump output for both versions and there’s no difference in SASS code at all, but the cubin produced with dlcm flag contains an additional attribute:

Attribute:	EIATTR_EXPLICIT_CACHING

SASS code size is ~31KB, so assuming L1 instruction cache size for Pascal is 8KB, L1.5 is 32KB (I read it in one of the “Dissecting GPU via microbenchmarking” papers), my theory is when I launch the kernel without EIATTR_EXPLICIT_CACHING attribute, the code gets cached in L1.5 cache and won’t go into L1, whereas EIATTR_EXPLICIT_CACHING forces L1 caching for part of the code that fits into it. If someone could shed some light on this, I would be very grateful.

Another thing I noticed is sometimes increasing the length of a loop’s body actually reduces the amount of instruction fetch stalls and provides a performance gain, which contradicts a general rule of “longer code → more instruction fetch stalls”. Do I need to “align” the loop body to some magic number of instructions to bring the stalls to the minimum? I know the information I’m looking for isn’t published, but perhaps there are some other observations people did?
Thanks.

Greg · June 26, 2019, 1:56am

When enabling dlcm=ca on Pascal GPUs thread blocks are distributed across a single SM partition (2 warp schedulers) instead of all 4 warps schedulers. This will impact occupancy and scheduling. Caching of L1 data can result in lower memory latency (long scoreboard stalls). The thread block scheduling change is done so that all warps in the thread block communicate with the same unified L1/TEX cache. I suspect the work distribution is resulting in difference in warp scheduling and L1 latency.

Instruction and constant caches access L2 through the GPC L1.5 cache but all local/global/texture/surface access go through the unified L1/TEX cache to the L2 cache. The attribute you mention does not change the policy of the instruction cache, constant cache, or L1.5 cache.

The instruction cache supports automatic prefetch. The compiler will attempt to align loop bodies to cache line boundaries when its heuristic determines this will provide a performance benefit. It is not possible via CUDA C to attempt to align the loop body. Unrolling a loop will change the distribution of stall reasons as the compiler may batch memory operations when unrolling. Single body loops tend to have higher short/long scoreboard stalls and allocation stalls (waiting for branch to resolve).

system · September 14, 2023, 8:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2502	January 7, 2016
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1561	February 14, 2019
code instruction cache? CUDA Programming and Performance	12	4581	July 31, 2015
How does reducing unrolling or branching code actually reduce instruction fetch? CUDA Programming and Performance	16	2729	December 4, 2016
Very weird behaviour CUDA Programming and Performance	20	1529	March 25, 2011
Instruction cache misses + Local stack and spills CUDA Programming and Performance cuda , kernel	9	67	January 11, 2025
Things related to stall reasons... or not so related CUDA Programming and Performance	6	2003	April 14, 2017
Unexplained high instruction replay CUDA Programming and Performance	6	2805	February 17, 2015
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1590	July 9, 2019
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1689	January 5, 2017

Instruction cache and instruction fetch stalls

Related topics