Hi,
I’m looking for some information about how the instruction cache works. My GPU is GTX 1070ti (Pascal). Code is compiled with CUDA 10.1.
Here’s my observation:
I’ve created a kernel that represents a loop with a long body. It doesn’t perform any global reads.
__device__ __forceinline__
uint2 ROTATE(const uint2 a, const uint32_t offset)
{
uint2 result;
asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.x) : "r"(a.y), "r"(a.x), "r"(offset));
asm("shf.r.wrap.b32 %0, %1, %2, %3;" : "=r"(result.y) : "r"(a.x), "r"(a.y), "r"(offset));
return result;
}
__device__ __forceinline__
uint2 ROTATE24(const uint2 a)
{
uint2 result;
result.x = __byte_perm(a.x, a.y, 0x6543);
result.y = __byte_perm(a.y, a.x, 0x6543);
return result;
}
__device__ __forceinline__
uint2 ROTATE16(const uint2 a)
{
uint2 result;
result.x = __byte_perm(a.x, a.y, 0x5432);
result.y = __byte_perm(a.y, a.x, 0x5432);
return result;
}
__global__
__launch_bounds__(256, 2)
void my_kernel(uint32_t* out)
{
uint32_t thread = blockDim.x * blockIdx.x + threadIdx.x;
uint2 data[4];
data[0] = make_uint2(thread, thread);
data[1] = make_uint2(thread + 5, thread + 5);
data[2] = make_uint2(thread + 15, thread + 15);
data[3] = make_uint2(thread + 20, thread + 20);
#pragma unroll 1
for (int t = 0; t < 64; t++)
{
#pragma unroll
for (int i = 0; i < 400; i++)
{
data[0] = ROTATE(data[0], 63);
data[1] = ROTATE24(data[1]);
data[2] = ROTATE16(data[2]);
data[3] ^= data[2];
}
}
if (data[0].x ^ data[1].x ^ data[2].x ^ data[3].x == 0 &&
data[0].y ^ data[1].y ^ data[2].y ^ data[3].y == 0)
out[0] = thread;
}
When I run it under nvprof, it shows 13% instruction fetch stalls. However, when I compile the module with “-Xptxas -dlcm=ca” flag the instruction fetch stalls drop to 5.7% and I see a performance improvement of ~4%.
I inspected cuobjdump output for both versions and there’s no difference in SASS code at all, but the cubin produced with dlcm flag contains an additional attribute:
Attribute: EIATTR_EXPLICIT_CACHING
SASS code size is ~31KB, so assuming L1 instruction cache size for Pascal is 8KB, L1.5 is 32KB (I read it in one of the “Dissecting GPU via microbenchmarking” papers), my theory is when I launch the kernel without EIATTR_EXPLICIT_CACHING attribute, the code gets cached in L1.5 cache and won’t go into L1, whereas EIATTR_EXPLICIT_CACHING forces L1 caching for part of the code that fits into it. If someone could shed some light on this, I would be very grateful.
Another thing I noticed is sometimes increasing the length of a loop’s body actually reduces the amount of instruction fetch stalls and provides a performance gain, which contradicts a general rule of “longer code → more instruction fetch stalls”. Do I need to “align” the loop body to some magic number of instructions to bring the stalls to the minimum? I know the information I’m looking for isn’t published, but perhaps there are some other observations people did?
Thanks.