A weird usage of registers... Bad allocation with repetitive tasks

Morph208 · January 9, 2008, 3:13am

Hey there,

I noticed something that I can’t really explain and I’d like your input on this.

The background first. I’ve been working on a finite element solver. Each element has 8 nodes. I have 2 passes, one over the elements and one over the nodes. We are interested here in the first pass. For each element, I have to compute some nodal quantities (so for each of the 8 nodes). To do that, I need quite a lot of data for each element. And the data are different between elements of course. Therefore, I need to make heaps of texture fetches, store the results in registers, do something with it, and write the result for each node and an array stored in global memory (in order to be re-used in the subsequent kernel). I’m explaining this because I don’t want you to be surprised by the number of registers I use. It’s not like I have the choice. I was able to use shared memory as registers a bit (since nothing needs to be shared between threads) but there’s not enough shared memory for using this technique everywhere.

Anyway, after a lot of tweaks, the best I could obtain was this memory repartition:

lmem = 0

smem = 468

reg = 88

Then I decided to gather the calculation for each node in a separate function, and call this function 8 times (instead of repeting the code in the kernel). I figured that it might help the compiler to save registers. So I have a code which looks like that:

       F0[tid] = computeElementForce(0, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F1[tid] = computeElementForce(1, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F2[tid] = computeElementForce(2, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F3[tid] = computeElementForce(3, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F4[tid] = computeElementForce(4, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F5[tid] = computeElementForce(5, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F6[tid] = computeElementForce(6, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F7[tid] = computeElementForce(7, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

There are a lot of arguments I know, most of them are float4. This uses heaps of registers, but re-doing all the texture fetches each time is way too costly so that’s the best I can do. The arrays Fi (i from 0 to 7) are in global memory and my final results.

I was able to save 4 registers. It’s not nothing but it’s not much. The funny thing now. I tried to comment out some of the node calculation to see how the amount of registers was evolving. Here are what I found:

1 node 0/62

2 nodes 0/76

3 nodes 0/92

4 nodes 0/110

5 nodes 80/121

6 nodes 4/78

7 nodes 4/81

8 nodes 0/84

where x/y with x amount of local memory and y is the number of registers. So with one call to this function, 62 registers I used in my kernel. With 2 calls, 76 registers etc.

Questions:

how do you explain this evolution in the number of registers (and local memory)?
why the number of registers don’t remain at 62? The compiler shouldn’t re-use the registers between each call?

AndreiB · January 9, 2008, 6:19am

Prediciting register usage is not trivial since nvcc inlines function calls and performs aggressive code optimisation. It has been reported that unrolling loops sometimes increases register usage, which seems to be your case.

Compiler tries to reuse registers, but it also tries to optimise code and avoid register bank conflicts etc., so this may be a reason for what you see.

There are 4 things worth trying:

Use -maxrregcount switch to limit number of registers. Always check performance since placing data to local memory may considerably slow down your kernel.
Place your calls in a loop with

#pragma unroll 1

. This will require minor changes to work with Fi but will probably save some registers.

Use construct like

if( threadIdx.x < 0 ) __syncthreads();

somewhere in your code (maybe between function calls). This trick has been posted on this forum and is known to reduce register usage.

Similar trick which I’ve been using in my kernels: declare constant variable, say g_nTrue, and enclose each of your function calls into conditional statement:

if( g_nTrue) F0[tid] = ...;

. Then set g_nTrue to 1 (with cudaMemcpyToSymbol).

Morph208 · January 9, 2008, 7:59am

There are 4 things worth trying:

Use -maxrregcount switch to limit number of registers. Always check performance since placing data to local memory may considerably slow down your kernel.

Place your calls in a loop with
#pragma unroll 1
. This will require minor changes to work with Fi but will probably save some registers.

Use construct like
if( threadIdx.x < 0 ) __syncthreads();
somewhere in your code (maybe between function calls). This trick has been posted on this forum and is known to reduce register usage.

Similar trick which I’ve been using in my kernels: declare constant variable, say g_nTrue, and enclose each of your function calls into conditional statement:
if( g_nTrue) F0[tid] = ...;
. Then set g_nTrue to 1 (with cudaMemcpyToSymbol).

[snapback]307274[/snapback]

Thanks for your answer. I tried your 4 propositions. In order to compare, the best I could obtain so far was 0.32ms with the following memory usage:

lmem = 0

reg = 84

Note I’m using 64 threads per block.

Now the results regarding your propositions.

I had already tried --maxrregcount. Here are the results:

–maxrregcount 64

lmem = 324

reg = 64

time = 0.40ms with 64 threads/block

time = 0.40ms with 128 threads/block

–maxrregcount 42

lmem = 468

reg = 41

time = 0.74ms with 64 threads/block

time = 0.61ms with 192 threads/block

–maxrregcount 21

lmem = 324

reg = 23

time = 0.90ms with 64 threads/block

time = 0.92ms with 320 threads/block

I also saw somewhere in the forum the trick #pragma unroll 1 and I had already tried.

lmem = 32

reg = 101

time = 0.41ms with 64 threads/block

if( threadIdx.x < 0 ) __syncthreads();

Changed nothing! (but add warnings when I compile, nvcc nicely tells me that threadIdx.x is always positive and that the test is pointless).

I saw your trick using g_nTrue as well. Already tried ;)

lmem = 0

reg = 93

time = 0.38ms with 64 threads/block

So it didn’t improve my situation here, sadly. I tried maxrregcount in combination of the other methods, just in case it would help nvcc to have other configurations. But it didn’t.

I’m beginning to think I can’t improve the registers count. I’m stuck with 64 threads per block, that is 8% of occupancy. But using local memory is very slow. When I use maxrregcount 21 and 320 threads per block (42% occupancy), it’s almost 3 times slower that in the case of 8% occupancy without local memory… Even though I found in some posts that local memory can be quite fast sometimes (http://forums.nvidia.com/index.php?showtopic=43259)

Topic		Replies	Views
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5564	July 4, 2007
Problem with reducing registers CUDA Programming and Performance	6	685	June 22, 2011
Questioning compiler's use of registers, how to get compiler to use registers more efficiently CUDA Programming and Performance	10	5296	March 30, 2009
Too high register usage for a simple problem CUDA Programming and Performance	1	694	May 4, 2014
reducing the number of used registers CUDA Programming and Performance	8	6405	September 22, 2009
Use of register An odd problem CUDA Programming and Performance	12	2429	August 12, 2010
high register count my kernel uses too many registers CUDA Programming and Performance	6	6130	August 26, 2008
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4259	October 16, 2009
how to reduce registers in each kernel CUDA Programming and Performance	2	1172	November 4, 2009
Freeing of temporary registers CUDA Programming and Performance	5	4441	May 21, 2007

A weird usage of registers... Bad allocation with repetitive tasks

Related topics