Instruction cache and instruction fetch stalls

Hi,

I’m looking for some information about how the instruction cache works. My GPU is GTX 1070ti (Pascal). Code is compiled with CUDA 10.1.
Here’s my observation:

I’ve created a kernel that represents a loop with a long body. It doesn’t perform any global reads.

__device__ __forceinline__
uint2 ROTATE(const uint2 a, const uint32_t offset)
{
	uint2 result;
	asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.x) : "r"(a.y), "r"(a.x), "r"(offset));
	asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.y) : "r"(a.x), "r"(a.y), "r"(offset));
	return result;
}

__device__ __forceinline__
uint2 ROTATE24(const uint2 a)
{
	uint2 result;
	result.x = __byte_perm(a.x, a.y, 0x6543);
	result.y = __byte_perm(a.y, a.x, 0x6543);
	return result;
}

__device__ __forceinline__
uint2 ROTATE16(const uint2 a)
{
	uint2 result;
	result.x = __byte_perm(a.x, a.y, 0x5432);
	result.y = __byte_perm(a.y, a.x, 0x5432);
	return result;
}

__global__
__launch_bounds__(256, 2)
void my_kernel(uint32_t* out)
{
	uint32_t thread = blockDim.x * blockIdx.x + threadIdx.x;

	uint2 data[4];
	data[0] = make_uint2(thread, thread);
	data[1] = make_uint2(thread + 5, thread + 5);
	data[2] = make_uint2(thread + 15, thread + 15);
	data[3] = make_uint2(thread + 20, thread + 20);

	#pragma unroll 1
	for (int t = 0; t < 64; t++)
	{
		#pragma unroll
		for (int i = 0; i < 400; i++)
		{
			data[0] = ROTATE(data[0], 63);
			data[1] = ROTATE24(data[1]);
			data[2] = ROTATE16(data[2]);
			data[3] ^= data[2];
		}
	}

	if (data[0].x ^ data[1].x ^ data[2].x ^ data[3].x == 0 &&
		data[0].y ^ data[1].y ^ data[2].y ^ data[3].y == 0)
		out[0] = thread;
}

When I run it under nvprof, it shows 13% instruction fetch stalls. However, when I compile the module with “-Xptxas -dlcm=ca” flag the instruction fetch stalls drop to 5.7% and I see a performance improvement of ~4%.
I inspected cuobjdump output for both versions and there’s no difference in SASS code at all, but the cubin produced with dlcm flag contains an additional attribute:

Attribute:	EIATTR_EXPLICIT_CACHING

SASS code size is ~31KB, so assuming L1 instruction cache size for Pascal is 8KB, L1.5 is 32KB (I read it in one of the “Dissecting GPU via microbenchmarking” papers), my theory is when I launch the kernel without EIATTR_EXPLICIT_CACHING attribute, the code gets cached in L1.5 cache and won’t go into L1, whereas EIATTR_EXPLICIT_CACHING forces L1 caching for part of the code that fits into it. If someone could shed some light on this, I would be very grateful.

Another thing I noticed is sometimes increasing the length of a loop’s body actually reduces the amount of instruction fetch stalls and provides a performance gain, which contradicts a general rule of “longer code -> more instruction fetch stalls”. Do I need to “align” the loop body to some magic number of instructions to bring the stalls to the minimum? I know the information I’m looking for isn’t published, but perhaps there are some other observations people did?
Thanks.

When enabling dlcm=ca on Pascal GPUs thread blocks are distributed across a single SM partition (2 warp schedulers) instead of all 4 warps schedulers. This will impact occupancy and scheduling. Caching of L1 data can result in lower memory latency (long scoreboard stalls). The thread block scheduling change is done so that all warps in the thread block communicate with the same unified L1/TEX cache. I suspect the work distribution is resulting in difference in warp scheduling and L1 latency.

Instruction and constant caches access L2 through the GPC L1.5 cache but all local/global/texture/surface access go through the unified L1/TEX cache to the L2 cache. The attribute you mention does not change the policy of the instruction cache, constant cache, or L1.5 cache.

The instruction cache supports automatic prefetch. The compiler will attempt to align loop bodies to cache line boundaries when its heuristic determines this will provide a performance benefit. It is not possible via CUDA C to attempt to align the loop body. Unrolling a loop will change the distribution of stall reasons as the compiler may batch memory operations when unrolling. Single body loops tend to have higher short/long scoreboard stalls and allocation stalls (waiting for branch to resolve).