Weird use of registers Too many registers are wasted

I don’t understand why nvcc uses so many registers for my kernel. Here is the code I got:

   float XT[3][3];

    float4 Dh;

    float4 NodeDisp;

   /**

      * First contribution

      */

   /// Grab some values from textures

    Dh = tex2D(DhCX_ref, texX, texY);

    indY = (int)floor( __fdividef((float)ElNodes.x,MAXLENGTH) );

    indX = ElNodes.x - indY*MAXLENGTH;

    NodeDisp = tex2D(Disp_ref, indX, indY);      

   /// Computations

        XT[0][0] = Dh.x*NodeDisp.x;

        XT[1][0] = Dh.y*NodeDisp.x;

        XT[2][0] = Dh.z*NodeDisp.x;    

       XT[0][1] = Dh.x*NodeDisp.y;

        XT[1][1] = Dh.y*NodeDisp.y;

        XT[2][1] = Dh.z*NodeDisp.y;    

       XT[0][2] = Dh.x*NodeDisp.z;

        XT[1][2] = Dh.y*NodeDisp.z;

        XT[2][2] = Dh.z*NodeDisp.z;

        

    /**

      * Second contribution

      */

   // Grab other values re-using the same temporary float4

    Dh = tex2D(DhCY_ref, texX, texY);

    indY = (int)floor( __fdividef((float)ElNodes.y,MAXLENGTH) );

    indX = ElNodes.y - indY*MAXLENGTH;

    NodeDisp = tex2D(Disp_ref, indX, indY);    

   /// Computations

        XT[0][0] += Dh.x*NodeDisp.x;

        XT[1][0] += Dh.y*NodeDisp.x;

        XT[2][0] += Dh.z*NodeDisp.x;    

       XT[0][1] += Dh.x*NodeDisp.y;

        XT[1][1] += Dh.y*NodeDisp.y;

        XT[2][1] += Dh.z*NodeDisp.y;    

       XT[0][2] += Dh.x*NodeDisp.z;

        XT[1][2] += Dh.y*NodeDisp.z;

        XT[2][2] += Dh.z*NodeDisp.z;

        ...

I got 4 contributions like this. If I do only the first one the kernel uses 12 registers. When I add the second contribution I reach 16 registers. The third contribution reaches 24 registers. At the end with the fourth contribution the kernel uses 32 registers… Why it’s not re-using the same registers to do the computations? How can I reduce the number of used registers? It really doesn’t make sense to me but I’m not an expert in low level language (assembly-like).

what are the code around them?
maybe you used something like the texture coordinate later, and the compiler stupidly did common subexpression.
most registers in my kernels are wasted this way, the only way i found to stop it is to store would-be common subexpressions in volatile shared memory.

I had such problem. My problem was solved then I do a circle. I have noticed, that the amount of registers increases at circle expand.

Try to do from this code

       XT[0][0] = Dh.x*NodeDisp.x;

        XT[1][0] = Dh.y*NodeDisp.x;

        XT[2][0] = Dh.z*NodeDisp.x;    

       XT[0][1] = Dh.x*NodeDisp.y;

        XT[1][1] = Dh.y*NodeDisp.y;

        XT[2][1] = Dh.z*NodeDisp.y;    

       XT[0][2] = Dh.x*NodeDisp.z;

        XT[1][2] = Dh.y*NodeDisp.z;

        XT[2][2] = Dh.z*NodeDisp.z;

a circle like this:

      for(i)

           for(j)

           {

            //calculation Value

            XT[i][j] = Value;

           }

In my case the number of registers was reduced in 3 times.

What do you mean? If I use texX and texY later in my code? Yes of course. And it would be my problem? I’m sorry but what are you calling common subexpressions and how can I use volatile memory? Can you explain me this please?

Because I’ve already try something like this:

      float4 temp;

       temp = Dh.x*NodeDisp.x + 1.0f;

        XT[0][0] = temp;

        temp = Dh.y*NodeDisp.x;

        XT[1][0] = temp;

        temp = Dh.z*NodeDisp.x;

        XT[2][0] = temp;    

       temp = Dh.x*NodeDisp.y;

        XT[0][1] = temp;

        temp = Dh.y*NodeDisp.y + 1.0f;

        XT[1][1] = temp;

        temp = Dh.z*NodeDisp.y;

        XT[2][1] = temp;

       temp = Dh.x*NodeDisp.z;

        XT[0][2] = temp;

        temp = Dh.y*NodeDisp.z;

        XT[1][2] = temp;

        temp = Dh.z*NodeDisp.z + 1.0f;

        XT[2][2] = temp;

But nothing changed. And with ‘float4 volatile temp;’ I save 1 register for the first contribution. But at the end it uses 36 registers for the 4 contributions (so it’s even worse). But maybe it’s not what you meant.

I’ve just tried this. I had to use 3 switch to choose between my different textures so it’s not very efficient. But I could reduce the number of registers from 32 to 23. This is something. But it’s still using too many registers that it should. And at the end of my kernel, because this part is not the actual end, my gain is only of 4 registers (36 to 32).

I reckon that with a lot of different textures and computation it’s not gonna be easy for me to reduce the use of registers…

Edit: actually I got lmem=36 instead of 0 now… It’s not really a good plan to do this…

well, common subexpression is the compiler optimization that uses up a register to store an expression’s value to avoid it from being computed twice.

if something in you program are computed and reusable across contributions, or something you computed during the contribution may be used later, the compiler may use up a register to store it.

to use volatile memory, do something like this:

int b;

//blah blah blah

extern __shared__ int shmem[];

float volatile *ff=(float*)shmem+threadIdx.x;

*ff=(float)b;//an expression you don't want the compiler to "optimize"

//blah blah blah using *ff in place of (float)b

In your case, the common subexpression could be result of (float)ElNodes.x and stuff. I can’t be quite sure about this, though. It’s better to read the ptx and find what indeed the compiler is doing.

You can also use the --maxrregcount command line option to nvcc. This will set a hard limit on the number of registers that nvcc will use for your kernel (rounded up to the nearest multiple of four). Note that you should experiment to find what number gives the best performance, as with very low numbers of registers your code may require lots of swapping out to shared memory. Also, if you pick too low a number nvcc will never finish compiling.

Without being able to see the code produced, I would guess that the reason you’re seeing such high register usage is because your code has three sets of two texture fetches that happen to be independent of each other. I’ll hazard a guess that the ideal way to compile these is to ‘hoist these’ texture fetches as early as possible (as they take hundreds of cycles) so that your code looks like:

   /// Grab some values from textures

   temp1_Dh = tex2D(DhCX_ref, texX, texY);

   ... // calculate indX, indY

   temp1_NodeDisp = tex2D(Disp_ref, indX, indY);      

   temp2_Dh = tex2D(DhCY_ref, texX, texY);

   ... // calculate indX, indY

   temp2_NodeDisp = tex2D(Disp_ref, indX, indY);

   temp3_Dh = tex2D(DhCY_ref, texX, texY);

   ... // calculate indX, indY

    temp3_NodeDisp = tex2D(Disp_ref, indX, indY);

   ... // NOW do all of those calculations of XT

To a compiler, your attempts to ‘re-use’ Dh and NodeDisp are meaningless. Most compilers don’t so much look at single variables but live ranges of a variable - that is, the point from a definition to the point of the last use of that definition. So a compiler can see quite effortlessly that there’s no relation between the first, second and third definitions of Dh and NodeDisp.

Before you try to bludgeon the compiler into not doing this (various loop tricks, for example), there is a good reason why the compiler would try to hoist these calculations as high as possible. All of them have very high latency, and so doing 6 of them in a row allows you to ‘pipeline’ the texture loads together so that you pay for the latency of just the first load, not all three pairs. You might lose nearly a factor of three of performance as a result.

Getting register count down has become a bit of an obsession around here (I think the occupancy spreadsheet is too tempting a venue for optimization, and much easier than measuring real performance), but having high occupancy doesn’t necessarily imply high performance. If you successfully manage to break the above optimization, you might be able to fit more threads onto each multiprocessor, but you might find that the net result is much slower.

Geoff.

Thank you the 3 of you for your answers.

Geoff> Thank you for your complete explanation, it’s gonna be helpful. Indeed my first attempt with this code was to do all the texture fetching first, and then all the computations. By trying to ‘re-use’ Dh and NodeDisp (which is meaningless for the compiler, I understood) I saved 4 registers. I was fully aware it was useless to save only 4 registers, specially if I’m using like 36 registers in total. I would need to divide this number by 2 if I want to hope improved performance.

I started to think it’s not gonna be possible to reduce the registers anyway. And since I’ve got something like 9 or 10 texture fetchings in my kernel, with what you said, it would be totally useless to reduce the number of registers (for performance). So I think I’m not gonna waste energy in this fight anymore…

Nathan> I’ve already tried --maxrregcount. First like you said if the number is too low it doesn’t compile anymore. And second, for saving registers nvcc uses local memory instead. Even if I use this option for saving, say 4 registers only, I can see a lower performance because of this. So it’s not a good option for me.

asadafag> ok, it’s clearer now. I might try this. I’m not sure it’s gonna give me a lot of speed improvements after what Geoff said. I think my kernel is definitely bound by the memory latency. But I will try I think, at least to be sure.