Optimization, high register usage with templates how to best go about utilizing registers ?


I’ve been trying to optimize some of my kernel code further by enabling more and more data to be fit on-chip for intensive computations. After having maxed out my shared memory usage i am now trying to use more registers to be able to fit more data on-chip to achieve furhter speedup.

This seems to be a good strategy when you dont to many threads per block. “Too much thread parallelism hurts” - V. Volkov

So, what is the best way to go about it when wanting to use a lot of registers without doing too much manual “hard-coding”? I didn’t find too many references about it so I did some testing on my own…

Some of my test code looks as follows:

template<unsigned int x_dim, unsigned int y_dim>

__global__ static void registerUsageKernel(float* output, float* input)


	__shared__ float B_matrix[x_dim][y_dim];

	float columnSum = 0;

// registers...

	float A_column[y_dim];

#pragma unroll

	for(int i = 0; i < y_dim; i++)

	{	// just use some input values to get some work done

		B_matrix[i][threadIdx.x] = input[i*x_dim + threadIdx.x]*2;

		A_column[i] = input[i*x_dim + threadIdx.x];





	// Do multiplikation and Add over each column

#pragma unroll

	for(int i = 0; i < y_dim; i++)


		columnSum += A_column[i]*B_matrix[i][threadIdx.x];



	// put out the answer;

	output[threadIdx.x] = columnSum;


What I’m wondering is if there is a smoother way to go about this? Do you have any suggestions ?



I’m honestly quite curious to know people are using the registers in a similar manner OR if it cauese some issues that I haven’t been able to forsee…

Does it for example have an effect on register dependency latencies ?

I’ve done very similar things - your case looks relatively simple compared to some of the messes I’ve had with register arrays.

I don’t think you need to worry about the latencies if you have over 192 threads per MP.

Ok, cool, the code above isn’t my actual problem. In my real problem I’m doing similar things but there are MANY loop layers and I’m afraid the unrolling isn’t going to work anymore ( which leads the compiler to put to everything ending in local memory instead… ).

But if many people are doing it in a similar way then i should at least give it a shot.


Yeah - one of my really problems involved lots and lots of manual unrolling. I seem to remember I had texture reads, and nvcc doesn’t like to unroll texture reads…

Wow, got it working after some serious tinkering with all my loops. Now able to fit 2 64 by 62 matrices on-chip :D

Got at 40% performance bump, estimated at 155.9 GFLOPS :)

I’ve continued coding more and more into registers until i was finally able to dump all of my matrices into registers (getting everything to unroll nicely was tough!). Thus far I’ve double my speed compared to using shared memory. Just for kicks I tried doing some heavy trigonomerty in the inner parts of my loops and I’m now getting well over 300 GLFOPs ( I know, it’s a tacky measuring metric).

My conclusion is this: If you’re problem is well defined you should avoid using shared memory like it’s the plague. Less shared memory, fewer threads and more registers.

This seems to work well for some problem and I’m sure there are a lot of people who don’t agree with this programming philosophy.

It’s not about philosopy but how good loop unrolling are…
I had some troubles in the past with that concept. I wanted to implement a sorting algorithm for 32-element arrays. Simple thing that can be performed by a single warp without any __syncthreads…
I just took the bitonic sort from the SDK and plugged in a constant value of 32 and removed some useless stuff because of that.
It worked fine but wanted to try something faster. Adding #pragma unroll did not help at all.
However when I manually unrolled both loops which are in there, got a serious performance increase - something like 2 times faster, if I recall correctly!

Pre 2.3 if #pragma unroll didn’t work it wouldn’t give any warning. Often loops don’t unroll with #pragma unroll even though it looks like they should. This may be why manual unrolling helped you.

By the way, unless y_dim is totally static and known at compile time, I don’t see how one could hope to unroll those loops.

It is indeed static and known at compile time as it’s template variable.

Oops, didn’t notice the first line.

I should get some sleep I guess :)

I guess the main reason for the performance gain here was that i was able fit more data on-chip.

Assuming that I manage to avoid bank conflicts, do you guys think that the shared memory is as fast as registers? ( like the PG claims)

While trying these on some old apps i did however run into some trouble when having shared memory in the loop I was unrolling.

If I unroll more than 63 iterations I will get:

nvopencc ERROR: C:\CUDA\bin/…/open64/lib//be.exe returned non-zero status -1073741819

I tried allocating separate shared memory portions and split the loop into several parts, this didn’t help however.

Any suggestions?

Ok, turns out it didn’t matter anyways since the global minimum for computational time was achieved with a mere 24 registers per thread. This must be because i could then achieve much higher occupancy. Got perfomance up somewhere about 300%, with another tacky FLOP/s ending at around 240 GFLOPs for my problem. This is on a GTX 260.

Ah, sorry for spamming these boards!

240 GFLOPs is not bad! Most I’ve ever seen out of a single card is ~110 GFLOPS.

Yes, in sweden we call it “Gigafläsk” ;)