I’ve written some code to compute bitonic sort, and I’m having a problem optimizing it. One optimization is to make some threads do more compare/swaps in an effort to try to minimize multiple global memory accesses. The kernel code with that optimization (I thought) looks like this:
__constant__ int normalized_compares[160] = {
0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23, 8, 24, 9, 25, 10, 26, 11, 27, 12, 28, 13, 29, 14, 30, 15, 31,
0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15, 16, 24, 17, 25, 18, 26, 19, 27, 20, 28, 21, 29, 22, 30, 23, 31,
0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15, 16, 20, 17, 21, 18, 22, 19, 23, 24, 28, 25, 29, 26, 30, 27, 31,
0, 2, 1, 3, 4, 6, 5, 7, 8, 10, 9, 11, 12, 14, 13, 15, 16, 18, 17, 19, 20, 22, 21, 23, 24, 26, 25, 27, 28, 30, 29, 31,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};
__global__ void Kernel3(int * a, int length, int phase, int level, int level2)
{
int ig = threadIdx.x
+ blockDim.x * blockIdx.x
+ blockDim.x * gridDim.x * blockDim.y * blockIdx.y
+ blockDim.x * gridDim.x * threadIdx.y;
if (ig < 0)
return;
if (ig >= length)
return;
const int degree = 5;
const int msize = 32;
const int csize = 160;
register int memory[msize];
for (int i = 0; i < msize; ++i)
memory[i] = i * level2;
int threads = memory[1];
int block_size = memory[1] * msize;
int base = ig % threads + block_size * (int)(ig / threads);
int mm[msize]; // can't seem to do register int mm[msize];
for (int i = 0; i < msize; ++i)
{
mm[i] = a[base + memory[i]];
}
for (int i = 0; i < csize; i += 2)
{
int cross = normalized_compares[i];
int paired = normalized_compares[i+1];
swap(mm[cross], mm[paired]);
}
for (int i = 0; i < msize; ++i)
{
a[base + memory[i]] = mm[i];
}
}
This code is supposed to load global memory into automatic array mm, compare and swap the values in mm, then store the values back to global memory. At least, that’s what I’d like it to do.
Unfortunately, the integer array mm is represented in PTX as local memory (use --keep to see the PTX). Local memory is essentially per-thread global memory, so there is no benefit of fetching the data into mm.
The really sad thing is that the entire array mm is actually also contained in registers. Looking at the PTX, there are around 260 registers, and there is a unique register used for each element in the array. For example, register %r215, used in the calculation of the last element of mm, is never used after storing the value into local memory.
add.s32 %r212, %r78, %r85;
mul.lo.u32 %r213, %r212, 4;
add.u32 %r214, %r213, %r89;
ld.global.s32 %r215, [%r214+0];
st.local.s32 [__cuda___cuda_local_var_144493_6_non_const_mm_12844+124], %r215;
According to the NVIDIA CUDA C Programming Guide:
"Automatic variables that the compiler is likely to place in local memory are:
-
Arrays for which it cannot determine that they are indexed with constant quantities,
-
Large structures or arrays that would consume too much register space,
-
Any variable if the kernel uses more registers than available (this is also known as register spilling)."
I’m not sure what to do to force the compiler to just represent the array in registers. I could start, I suppose, by hardwiring the loop that contains the swaps with constant quantities, e.g.,
swap(mm[0], mm[16]);
swap(mm[1], mm[17]);
…
Anyone have any ideas?
Ken