Hey there,
I noticed something that I can’t really explain and I’d like your input on this.
The background first. I’ve been working on a finite element solver. Each element has 8 nodes. I have 2 passes, one over the elements and one over the nodes. We are interested here in the first pass. For each element, I have to compute some nodal quantities (so for each of the 8 nodes). To do that, I need quite a lot of data for each element. And the data are different between elements of course. Therefore, I need to make heaps of texture fetches, store the results in registers, do something with it, and write the result for each node and an array stored in global memory (in order to be re-used in the subsequent kernel). I’m explaining this because I don’t want you to be surprised by the number of registers I use. It’s not like I have the choice. I was able to use shared memory as registers a bit (since nothing needs to be shared between threads) but there’s not enough shared memory for using this technique everywhere.
Anyway, after a lot of tweaks, the best I could obtain was this memory repartition:
lmem = 0
smem = 468
reg = 88
Then I decided to gather the calculation for each node in a separate function, and call this function 8 times (instead of repeting the code in the kernel). I figured that it might help the compiler to save registers. So I have a code which looks like that:
F0[tid] = computeElementForce(0, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F1[tid] = computeElementForce(1, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F2[tid] = computeElementForce(2, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F3[tid] = computeElementForce(3, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F4[tid] = computeElementForce(4, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F5[tid] = computeElementForce(5, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F6[tid] = computeElementForce(6, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
F7[tid] = computeElementForce(7, Dh0_a, Dh0_b, Dh1_a, Dh1_b, Dh2_a, Dh2_b, Node1Disp, Node2Disp, Node3Disp, Node4Disp,
Node5Disp, Node6Disp, Node7Disp, Node8Disp, SPK, tid);
There are a lot of arguments I know, most of them are float4. This uses heaps of registers, but re-doing all the texture fetches each time is way too costly so that’s the best I can do. The arrays Fi (i from 0 to 7) are in global memory and my final results.
I was able to save 4 registers. It’s not nothing but it’s not much. The funny thing now. I tried to comment out some of the node calculation to see how the amount of registers was evolving. Here are what I found:
1 node 0/62
2 nodes 0/76
3 nodes 0/92
4 nodes 0/110
5 nodes 80/121
6 nodes 4/78
7 nodes 4/81
8 nodes 0/84
where x/y with x amount of local memory and y is the number of registers. So with one call to this function, 62 registers I used in my kernel. With 2 calls, 76 registers etc.
Questions:
-
how do you explain this evolution in the number of registers (and local memory)?
-
why the number of registers don’t remain at 62? The compiler shouldn’t re-use the registers between each call?