How to optimize global memory accesses? Data in registers, but also local memory.

Ken_Domino · October 21, 2011, 8:27pm

I’ve written some code to compute bitonic sort, and I’m having a problem optimizing it. One optimization is to make some threads do more compare/swaps in an effort to try to minimize multiple global memory accesses. The kernel code with that optimization (I thought) looks like this:

__constant__ int normalized_compares[160] = {

    0, 16, 1, 17, 2, 18, 3, 19, 4, 20,  5, 21,  6, 22,  7, 23,  8, 24,  9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,

    0,  8, 1,  9, 2, 10, 3, 11, 4, 12,  5, 13,  6, 14,  7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,

    0,  4, 1,  5, 2,  6, 3,  7, 8, 12,  9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,

    0,  2, 1,  3, 4,  6, 5,  7, 8, 10,  9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,

    0,  1, 2,  3, 4,  5, 6,  7, 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void Kernel3(int * a, int length, int phase, int level, int level2)

{

    int ig = threadIdx.x

            + blockDim.x * blockIdx.x

            + blockDim.x * gridDim.x * blockDim.y * blockIdx.y

            + blockDim.x * gridDim.x * threadIdx.y;

    if (ig < 0)

        return;

    if (ig >= length)

        return;

    const int degree = 5;

const int msize = 32;

    const int csize = 160;

    register int memory[msize];

    for (int i = 0; i < msize; ++i)

        memory[i] = i * level2;

    int threads = memory[1];

    int block_size = memory[1] * msize;

    int base = ig % threads + block_size * (int)(ig / threads);

    int mm[msize]; // can't seem to do register int mm[msize];

    for (int i = 0; i < msize; ++i)

    {

        mm[i] = a[base + memory[i]];

    }

    for (int i = 0; i < csize; i += 2)

    {

        int cross = normalized_compares[i];

        int paired = normalized_compares[i+1];

        swap(mm[cross], mm[paired]);

    }

    for (int i = 0; i < msize; ++i)

    {

        a[base + memory[i]] = mm[i];

    }

}

This code is supposed to load global memory into automatic array mm, compare and swap the values in mm, then store the values back to global memory. At least, that’s what I’d like it to do.

Unfortunately, the integer array mm is represented in PTX as local memory (use --keep to see the PTX). Local memory is essentially per-thread global memory, so there is no benefit of fetching the data into mm.

The really sad thing is that the entire array mm is actually also contained in registers. Looking at the PTX, there are around 260 registers, and there is a unique register used for each element in the array. For example, register %r215, used in the calculation of the last element of mm, is never used after storing the value into local memory.

add.s32 	%r212, %r78, %r85;

	mul.lo.u32 	%r213, %r212, 4;

	add.u32 	%r214, %r213, %r89;

	ld.global.s32 	%r215, [%r214+0];

	st.local.s32 	[__cuda___cuda_local_var_144493_6_non_const_mm_12844+124], %r215;

According to the NVIDIA CUDA C Programming Guide:

"Automatic variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling)."

I’m not sure what to do to force the compiler to just represent the array in registers. I could start, I suppose, by hardwiring the loop that contains the swaps with constant quantities, e.g.,

swap(mm[0], mm[16]);

swap(mm[1], mm[17]);

…

Anyone have any ideas?

Ken

tera · October 21, 2011, 9:00pm

You need to unroll the loops (by preceding them with [font=“Courier New”]#pragma unroll[/font]).

The [font=“Courier New”]register[/font] keyword doesn’t have any effect in modern compilers anymore. And don’t look at registers in PTX code too much, because the real register allocation happens at a later stage (during [font=“Courier New”]ptxas[/font] “assembly”).

Ken_Domino · October 22, 2011, 1:59pm

“#pragma unroll” didn’t work. Instead, the compiler printed the warning “Advisory: Loop was not unrolled, unexpected control flow construct”. Unrolling the middle for-loop manually did work, and local storage is now avoided (see the code below). “#pragma unroll” had no effect on the first and third for-loops, i.e., the PTX was exactly the same with or without the preceding pragma; the 1st and 3rd loops are already unrolled by the compiler. This could be because I had compiler optimizations turned on.

set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\"

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe" --use-local-env --cl-version 2010 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin"     --keep --keep-dir "Release" -maxrregcount=0  --machine 32 --compile      -Xcompiler "/EHsc /nologo /O2 /Zi  /MD " -o "Release\cuda.cu.obj" "C:\Users\Ken\Documents\Visual Studio 2010\Projects\Bitonic-CUDA\cuda.cu"

I’m not sure what the compiler is complaining about. It should be able to unroll this example because all the indices are constants. I solved my problem, but it would be great if the pragma would work.

__device__ __host__ inline void swap(int & a, int & b)

{

    if (a < b)

        return;

    int tmp = a;

    a = b;

    b = tmp;

}

__constant__ int normalized_compares[160] = {

    0, 16, 1, 17, 2, 18, 3, 19, 4, 20,  5, 21,  6, 22,  7, 23,  8, 24,  9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,

    0,  8, 1,  9, 2, 10, 3, 11, 4, 12,  5, 13,  6, 14,  7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,

    0,  4, 1,  5, 2,  6, 3,  7, 8, 12,  9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,

    0,  2, 1,  3, 4,  6, 5,  7, 8, 10,  9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,

    0,  1, 2,  3, 4,  5, 6,  7, 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void Kernel3(int * a, int length, int phase, int level, int level2)

{

    int ig = threadIdx.x

            + blockDim.x * blockIdx.x

            + blockDim.x * gridDim.x * blockDim.y * blockIdx.y

            + blockDim.x * gridDim.x * threadIdx.y;

    if (ig < 0)

        return;

    if (ig >= length)

        return;

    const int degree = 5;

const int msize = 32;

    const int csize = 160;

    int memory[msize];

    for (int i = 0; i < msize; ++i)

        memory[i] = i * level2;

    int threads = memory[1];

    int block_size = memory[1] * msize;

    int base = ig % threads + block_size * (int)(ig / threads);

    int mm[msize];

    #pragma unroll

    for (int i = 0; i < msize; ++i)

    {

        mm[i] = a[base + memory[i]];

    }

#define XXX

#ifdef XXX

    #pragma unroll

    for (int i = 0; i < csize; i += 2)

    {

        int cross = normalized_compares[i];

        int paired = normalized_compares[i+1];

        swap(mm[cross], mm[paired]);

    }

#else

    swap(mm[0], mm[16]);

    swap(mm[1], mm[17]);

    swap(mm[2], mm[18]);

    swap(mm[3], mm[19]);

    swap(mm[4], mm[20]);

    swap(mm[5], mm[21]);

    swap(mm[6], mm[22]);

    swap(mm[7], mm[23]);

    swap(mm[8], mm[24]);

    swap(mm[9], mm[25]);

    swap(mm[10], mm[26]);

    swap(mm[11], mm[27]);

    swap(mm[12], mm[28]);

    swap(mm[13], mm[29]);

    swap(mm[14], mm[30]);

    swap(mm[15], mm[31]);

swap(mm[0], mm[8]);

    swap(mm[1], mm[9]);

    swap(mm[2], mm[10]);

    swap(mm[3], mm[11]);

    swap(mm[4], mm[12]);

    swap(mm[5], mm[13]);

    swap(mm[6], mm[14]);

    swap(mm[7], mm[15]);

    swap(mm[16], mm[24]);

    swap(mm[17], mm[25]);

    swap(mm[18], mm[26]);

    swap(mm[19], mm[27]);

    swap(mm[20], mm[28]);

    swap(mm[21], mm[29]);

    swap(mm[22], mm[30]);

    swap(mm[23], mm[31]);

swap(mm[0], mm[4]);

    swap(mm[1], mm[5]);

    swap(mm[2], mm[6]);

    swap(mm[3], mm[7]);

    swap(mm[8], mm[12]);

    swap(mm[9], mm[13]);

    swap(mm[10], mm[14]);

    swap(mm[11], mm[15]);

    swap(mm[16], mm[20]);

    swap(mm[17], mm[21]);

    swap(mm[18], mm[22]);

    swap(mm[19], mm[23]);

    swap(mm[24], mm[28]);

    swap(mm[25], mm[29]);

    swap(mm[26], mm[30]);

    swap(mm[27], mm[31]);

swap(mm[0], mm[2]);

    swap(mm[1], mm[3]);

    swap(mm[4], mm[6]);

    swap(mm[5], mm[7]);

    swap(mm[8], mm[10]);

    swap(mm[9], mm[11]);

    swap(mm[12], mm[14]);

    swap(mm[13], mm[15]);

    swap(mm[16], mm[18]);

    swap(mm[17], mm[19]);

    swap(mm[20], mm[22]);

    swap(mm[21], mm[23]);

    swap(mm[24], mm[26]);

    swap(mm[25], mm[27]);

    swap(mm[28], mm[30]);

    swap(mm[29], mm[31]);

swap(mm[0], mm[1]);

    swap(mm[2], mm[3]);

    swap(mm[4], mm[5]);

    swap(mm[6], mm[7]);

    swap(mm[8], mm[9]);

    swap(mm[10], mm[11]);

    swap(mm[12], mm[13]);

    swap(mm[14], mm[15]);

    swap(mm[16], mm[17]);

    swap(mm[18], mm[19]);

    swap(mm[20], mm[21]);

    swap(mm[22], mm[23]);

    swap(mm[24], mm[25]);

    swap(mm[26], mm[27]);

    swap(mm[28], mm[29]);

    swap(mm[30], mm[31]);

#endif

#pragma unroll

    for (int i = 0; i < msize; ++i)

    {

        a[base + memory[i]] = mm[i];

    }

}

tera · October 22, 2011, 4:34pm

I just realized that, for the elimination of local memory to even be correct, [font=“Courier New”]normalized_compares[/font] needs to be defined as (constant) [font=“Courier New”]const int[/font]. But even with that, I couldn’t get the compiler to eliminate it either.

How is [font=“Courier New”]swap()[/font] defined?

Ken_Domino · October 22, 2011, 5:20pm

Swap contains an “if (a < b) return;” statement. (I thought that might matter, so I posted swap() in my reply. See the code above.) If the if-return is commented out, then the compiler unrolls the loop, with the PTX now one big, fat basic block. Good! That sort of explains why it doesn’t unroll the loop with the if-return. But, unfortunately, it still uses local memory, as though something is throwing constant propagation. Even with your suggestion to make normalized_compares as const (as well as constant), it does not fix the representation of mm with local memory.

tera · October 22, 2011, 5:29pm

I’ve also experienced a few times that the inliner really doesn’t like return statements (even if they are completely harmless), so it might be better to rewrite swap() as

__device__ __host__ inline void swap(int & a, int & b)

{

    if (a > b) {

        int tmp = a;

        a = b;

        b = tmp;

    }

}

or

__device__ __host__ inline void swap(int & a, int & b)

{

    int tmp_min = min(a, b);

    int tmp_max = max(a, b);

    a = tmp_min;

    b = tmp_max;

}

Ken_Domino · October 22, 2011, 6:17pm

Yep, the rewrite does seem to help the compiler find its way to unroll the loop. Looks like there may be a problem with #pragma unroll with inlined functions. I tried removing tmp in swap() by using an exclusive-or swap, but that doesn’t help the compiler stop using local storage.

Back on normalized_compares, I tried replacing constant with device for normalized_compares, and also tried making it auto within Kernel3. But, in both cases, that doesn’t help the compiler generate code to represent mm just in registers. In fact, making normalized_compares an auto (scoped within the kernel) makes the compiler generate PTX MOV/MOV instructions to initialize a register representation of normalized_compares. (That’s why I moved normalized_compares to constant space.) And, the code it generates looks pretty sad: it doubles the number of registers used, with an extra MOV to initialize the extra copy. See below:

mov.s32 	%r1, 0;

	mov.s32 	%r2, %r1;

	mov.s32 	%r3, 16;

	mov.s32 	%r4, %r3;

	mov.s32 	%r5, 1;

	mov.s32 	%r6, %r5;

	mov.s32 	%r7, 17;

	mov.s32 	%r8, %r7;

	mov.s32 	%r9, 2;

	mov.s32 	%r10, %r9;

Right, I think the compiler has multiple problems here, including constant propagation from an array.

tera · October 22, 2011, 6:39pm

Yep, I had also tried that and still couldn’t eliminate local storage.

In fact, making normalized_compares an auto (scoped within the kernel) makes the compiler generate PTX MOV/MOV instructions to initialize a register representation of normalized_compares. (That’s why I moved normalized_compares to constant space.) And, the code it generates looks pretty sad: it doubles the number of registers used, with an extra MOV to initialize the extra copy. See below:
mov.s32 	%r1, 0;

	mov.s32 	%r2, %r1;

	mov.s32 	%r3, 16;

	mov.s32 	%r4, %r3;

	mov.s32 	%r5, 1;

	mov.s32 	%r6, %r5;

	mov.s32 	%r7, 17;

	mov.s32 	%r8, %r7;

	mov.s32 	%r9, 2;

	mov.s32 	%r10, %r9;
Right, I think the compiler has multiple problems here, including constant propagation from an array.

Again, this can’t be judged from the PTX code generated as register allocation only takes place at the ptxas stage. So it is necessary to compile up to .cubin files and use [font=“Courier New”]cuobjdump -sass[/font] on them.

njuffa · October 22, 2011, 7:00pm

Based on what I understand from looking at similar unrolling issues, the “return” inside the swap() function is indeed what inhibits the unroller. The unroller likes nice single-entry single-exit control flows, such as loops and if-then-else. A return statement is a fancy way of saying “goto”, the bane of all optimizers.

Since “break” and “continue” inside loops are also hidden instances of “goto”, I suspect, but have not verified, that those likewise cause the unroller to give up. At least the compiler is nice enough to let programmers know why it cannot unroll a loop. Not all compilers do that.

As tera states, register usage cannot be judged from looking at PTX. The compiler generates PTX in SSA style which basically means that each new result written is assigned to a new register. For a description of SSA (static single assignment) see the following Wikipedia article: Static single-assignment form - Wikipedia