I don’t understand why nvcc uses so many registers for my kernel. Here is the code I got:

```
float XT[3][3];
float4 Dh;
float4 NodeDisp;
/**
* First contribution
*/
/// Grab some values from textures
Dh = tex2D(DhCX_ref, texX, texY);
indY = (int)floor( __fdividef((float)ElNodes.x,MAXLENGTH) );
indX = ElNodes.x - indY*MAXLENGTH;
NodeDisp = tex2D(Disp_ref, indX, indY);
/// Computations
XT[0][0] = Dh.x*NodeDisp.x;
XT[1][0] = Dh.y*NodeDisp.x;
XT[2][0] = Dh.z*NodeDisp.x;
XT[0][1] = Dh.x*NodeDisp.y;
XT[1][1] = Dh.y*NodeDisp.y;
XT[2][1] = Dh.z*NodeDisp.y;
XT[0][2] = Dh.x*NodeDisp.z;
XT[1][2] = Dh.y*NodeDisp.z;
XT[2][2] = Dh.z*NodeDisp.z;
/**
* Second contribution
*/
// Grab other values re-using the same temporary float4
Dh = tex2D(DhCY_ref, texX, texY);
indY = (int)floor( __fdividef((float)ElNodes.y,MAXLENGTH) );
indX = ElNodes.y - indY*MAXLENGTH;
NodeDisp = tex2D(Disp_ref, indX, indY);
/// Computations
XT[0][0] += Dh.x*NodeDisp.x;
XT[1][0] += Dh.y*NodeDisp.x;
XT[2][0] += Dh.z*NodeDisp.x;
XT[0][1] += Dh.x*NodeDisp.y;
XT[1][1] += Dh.y*NodeDisp.y;
XT[2][1] += Dh.z*NodeDisp.y;
XT[0][2] += Dh.x*NodeDisp.z;
XT[1][2] += Dh.y*NodeDisp.z;
XT[2][2] += Dh.z*NodeDisp.z;
...
```

I got 4 contributions like this. If I do only the first one the kernel uses 12 registers. When I add the second contribution I reach 16 registers. The third contribution reaches 24 registers. At the end with the fourth contribution the kernel uses 32 registers… Why it’s not re-using the same registers to do the computations? How can I reduce the number of used registers? It really doesn’t make sense to me but I’m not an expert in low level language (assembly-like).