A weird usage of registers... Bad allocation with repetitive tasks

Hey there,

I noticed something that I can’t really explain and I’d like your input on this.

The background first. I’ve been working on a finite element solver. Each element has 8 nodes. I have 2 passes, one over the elements and one over the nodes. We are interested here in the first pass. For each element, I have to compute some nodal quantities (so for each of the 8 nodes). To do that, I need quite a lot of data for each element. And the data are different between elements of course. Therefore, I need to make heaps of texture fetches, store the results in registers, do something with it, and write the result for each node and an array stored in global memory (in order to be re-used in the subsequent kernel). I’m explaining this because I don’t want you to be surprised by the number of registers I use. It’s not like I have the choice. I was able to use shared memory as registers a bit (since nothing needs to be shared between threads) but there’s not enough shared memory for using this technique everywhere.

Anyway, after a lot of tweaks, the best I could obtain was this memory repartition:

lmem = 0

smem = 468

reg = 88

Then I decided to gather the calculation for each node in a separate function, and call this function 8 times (instead of repeting the code in the kernel). I figured that it might help the compiler to save registers. So I have a code which looks like that:

       F0[tid] = computeElementForce(0, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F1[tid] = computeElementForce(1, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F2[tid] = computeElementForce(2, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F3[tid] = computeElementForce(3, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F4[tid] = computeElementForce(4, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F5[tid] = computeElementForce(5, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F6[tid] = computeElementForce(6, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

        F7[tid] = computeElementForce(7, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp, 

                                      Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);

There are a lot of arguments I know, most of them are float4. This uses heaps of registers, but re-doing all the texture fetches each time is way too costly so that’s the best I can do. The arrays Fi (i from 0 to 7) are in global memory and my final results.

I was able to save 4 registers. It’s not nothing but it’s not much. The funny thing now. I tried to comment out some of the node calculation to see how the amount of registers was evolving. Here are what I found:

1 node 0/62

2 nodes 0/76

3 nodes 0/92

4 nodes 0/110

5 nodes 80/121

6 nodes 4/78

7 nodes 4/81

8 nodes 0/84

where x/y with x amount of local memory and y is the number of registers. So with one call to this function, 62 registers I used in my kernel. With 2 calls, 76 registers etc.

Questions:

  • how do you explain this evolution in the number of registers (and local memory)?

  • why the number of registers don’t remain at 62? The compiler shouldn’t re-use the registers between each call?

Prediciting register usage is not trivial since nvcc inlines function calls and performs aggressive code optimisation. It has been reported that unrolling loops sometimes increases register usage, which seems to be your case.

Compiler tries to reuse registers, but it also tries to optimise code and avoid register bank conflicts etc., so this may be a reason for what you see.

There are 4 things worth trying:

  1. Use -maxrregcount switch to limit number of registers. Always check performance since placing data to local memory may considerably slow down your kernel.

  2. Place your calls in a loop with

#pragma unroll 1

. This will require minor changes to work with Fi but will probably save some registers.

  1. Use construct like
if( threadIdx.x < 0 ) __syncthreads();

somewhere in your code (maybe between function calls). This trick has been posted on this forum and is known to reduce register usage.

  1. Similar trick which I’ve been using in my kernels: declare constant variable, say g_nTrue, and enclose each of your function calls into conditional statement:
if( g_nTrue) F0[tid] = ...;

. Then set g_nTrue to 1 (with cudaMemcpyToSymbol).

Thanks for your answer. I tried your 4 propositions. In order to compare, the best I could obtain so far was 0.32ms with the following memory usage:

lmem = 0

reg = 84

Note I’m using 64 threads per block.

Now the results regarding your propositions.

  1. I had already tried --maxrregcount. Here are the results:

–maxrregcount 64

lmem = 324

reg = 64

time = 0.40ms with 64 threads/block

time = 0.40ms with 128 threads/block

–maxrregcount 42

lmem = 468

reg = 41

time = 0.74ms with 64 threads/block

time = 0.61ms with 192 threads/block

–maxrregcount 21

lmem = 324

reg = 23

time = 0.90ms with 64 threads/block

time = 0.92ms with 320 threads/block

  1. I also saw somewhere in the forum the trick #pragma unroll 1 and I had already tried.

lmem = 32

reg = 101

time = 0.41ms with 64 threads/block

  1. if( threadIdx.x < 0 ) __syncthreads();

Changed nothing! (but add warnings when I compile, nvcc nicely tells me that threadIdx.x is always positive and that the test is pointless).

  1. I saw your trick using g_nTrue as well. Already tried ;)

lmem = 0

reg = 93

time = 0.38ms with 64 threads/block

So it didn’t improve my situation here, sadly. I tried maxrregcount in combination of the other methods, just in case it would help nvcc to have other configurations. But it didn’t.

I’m beginning to think I can’t improve the registers count. I’m stuck with 64 threads per block, that is 8% of occupancy. But using local memory is very slow. When I use maxrregcount 21 and 320 threads per block (42% occupancy), it’s almost 3 times slower that in the case of 8% occupancy without local memory… Even though I found in some posts that local memory can be quite fast sometimes (http://forums.nvidia.com/index.php?showtopic=43259)