I’ve been trying to optimize some of my kernel code further by enabling more and more data to be fit on-chip for intensive computations. After having maxed out my shared memory usage i am now trying to use more registers to be able to fit more data on-chip to achieve furhter speedup.
This seems to be a good strategy when you dont to many threads per block. “Too much thread parallelism hurts” - V. Volkov
So, what is the best way to go about it when wanting to use a lot of registers without doing too much manual “hard-coding”? I didn’t find too many references about it so I did some testing on my own…
Some of my test code looks as follows:
template<unsigned int x_dim, unsigned int y_dim>
__global__ static void registerUsageKernel(float* output, float* input)
{
__shared__ float B_matrix[x_dim][y_dim];
float columnSum = 0;
// registers...
float A_column[y_dim];
#pragma unroll
for(int i = 0; i < y_dim; i++)
{ // just use some input values to get some work done
B_matrix[i][threadIdx.x] = input[i*x_dim + threadIdx.x]*2;
A_column[i] = input[i*x_dim + threadIdx.x];
}
__syncthreads();
// Do multiplikation and Add over each column
#pragma unroll
for(int i = 0; i < y_dim; i++)
{
columnSum += A_column[i]*B_matrix[i][threadIdx.x];
}
__syncthreads();
// put out the answer;
output[threadIdx.x] = columnSum;
}
What I’m wondering is if there is a smoother way to go about this? Do you have any suggestions ?
Ok, cool, the code above isn’t my actual problem. In my real problem I’m doing similar things but there are MANY loop layers and I’m afraid the unrolling isn’t going to work anymore ( which leads the compiler to put to everything ending in local memory instead… ).
But if many people are doing it in a similar way then i should at least give it a shot.
Yeah - one of my really problems involved lots and lots of manual unrolling. I seem to remember I had texture reads, and nvcc doesn’t like to unroll texture reads…
I’ve continued coding more and more into registers until i was finally able to dump all of my matrices into registers (getting everything to unroll nicely was tough!). Thus far I’ve double my speed compared to using shared memory. Just for kicks I tried doing some heavy trigonomerty in the inner parts of my loops and I’m now getting well over 300 GLFOPs ( I know, it’s a tacky measuring metric).
My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.
This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy.
It’s not about philosopy but how good loop unrolling are…
I had some troubles in the past with that concept. I wanted to implement a sorting algorithm for 32-element arrays. Simple thing that can be performed by a single warp without any __syncthreads…
I just took the bitonic sort from the SDK and plugged in a constant value of 32 and removed some useless stuff because of that.
It worked fine but wanted to try something faster. Adding #pragma unroll did not help at all.
However when I manually unrolled both loops which are in there, got a serious performance increase - something like 2 times faster, if I recall correctly!
Pre 2.3 if #pragma unroll didn’t work it wouldn’t give any warning. Often loops don’t unroll with #pragma unroll even though it looks like they should. This may be why manual unrolling helped you.
Ok, turns out it didn’t matter anyways since the global minimum for computational time was achieved with a mere 24 registers per thread. This must be because i could then achieve much higher occupancy. Got perfomance up somewhere about 300%, with another tacky FLOP/s ending at around 240 GFLOPs for my problem. This is on a GTX 260.