Too much local data How to work around local data size limitation?

I have a code that evaluates a 15-degree polynomial on 3 variables, x, y and z (but with no exponent larger than 5); it has 216 terms.

I’m using Horner’s scheme for evaluation to avoid redundant computations, written all down (no loops). The coefficients are in constant memory (the coeff array below). The expression looks like this:

float eval(float x, float y, float z)

{

    return x * (... very large sub expr here...)

           +

           y * (... slightly smaller sub expr here...)

           +

           z * (... much smaller sub expr here...)

           +

           coeff[0];

}

This is what I get when I try to compile with CUDA 3.2 RC2 with Thrust 1.3, for sm_13 target, on a 64-bit Linux system:

This happens with other, Thrust-free, kernels too.

The funny thing is if I use just the “x * (…)” part, commenting the other terms out, the kernel compiles just fine. Commenting out the “x * (…)” term and leaving the rest, also compiles. So instead of first computing the first term, and then going through the rest of the expression, the compiler seems to be interleaving the evaluation.

I already tried breaking down the expression into individual functions, force them to not be inlined, and a couple of other stuffs, but it seems impossible to keep the compiler from evaluating everything at once.

I ran out of ideas now. Other than perhaps break down the kernel into two, each evaluating half of the polynomial. Anyone had this kind of problem before?

Edit: forgot to add, without Horner’s scheme, with a naive polynomial evaluation, the compiler starts swapping on my 6 GB system and takes more than 24h (that’s how long I was willing to wait before killing it.)

I have a code that evaluates a 15-degree polynomial on 3 variables, x, y and z (but with no exponent larger than 5); it has 216 terms.

I’m using Horner’s scheme for evaluation to avoid redundant computations, written all down (no loops). The coefficients are in constant memory (the coeff array below). The expression looks like this:

float eval(float x, float y, float z)

{

    return x * (... very large sub expr here...)

           +

           y * (... slightly smaller sub expr here...)

           +

           z * (... much smaller sub expr here...)

           +

           coeff[0];

}

This is what I get when I try to compile with CUDA 3.2 RC2 with Thrust 1.3, for sm_13 target, on a 64-bit Linux system:

This happens with other, Thrust-free, kernels too.

The funny thing is if I use just the “x * (…)” part, commenting the other terms out, the kernel compiles just fine. Commenting out the “x * (…)” term and leaving the rest, also compiles. So instead of first computing the first term, and then going through the rest of the expression, the compiler seems to be interleaving the evaluation.

I already tried breaking down the expression into individual functions, force them to not be inlined, and a couple of other stuffs, but it seems impossible to keep the compiler from evaluating everything at once.

I ran out of ideas now. Other than perhaps break down the kernel into two, each evaluating half of the polynomial. Anyone had this kind of problem before?

Edit: forgot to add, without Horner’s scheme, with a naive polynomial evaluation, the compiler starts swapping on my 6 GB system and takes more than 24h (that’s how long I was willing to wait before killing it.)

You’ve probably also tried assigning the three subexpressions to three variables. Declaring them volatile might help further.

You’ve probably also tried assigning the three subexpressions to three variables. Declaring them volatile might help further.

Thanks for the volatile suggestion. Yes, I tried assigning the subexpressions to individual variables, but the compiler still produces the same result.

By making those temporary variables volatile I can see the local data usage decreasing, but very slowly; each breakdown into more variables reduces a little bit. I’m down at 0x4048 on the function with the smallest local data overflow, with 19 volatile temporaries, and getting worried seeing that it is decreasing each time with a decreasing rate.

I was doing something similar by breaking into non-inline functions, but the overflow stopped decreasing at some point, so I gave up. I’m afraid I might get a similar result at some point.

Update: with 2 more temporaries I reached the same point as with the non-inline functions: the local data overflow starts to increase at some point.

Thanks for the volatile suggestion. Yes, I tried assigning the subexpressions to individual variables, but the compiler still produces the same result.

By making those temporary variables volatile I can see the local data usage decreasing, but very slowly; each breakdown into more variables reduces a little bit. I’m down at 0x4048 on the function with the smallest local data overflow, with 19 volatile temporaries, and getting worried seeing that it is decreasing each time with a decreasing rate.

I was doing something similar by breaking into non-inline functions, but the overflow stopped decreasing at some point, so I gave up. I’m afraid I might get a similar result at some point.

Update: with 2 more temporaries I reached the same point as with the non-inline functions: the local data overflow starts to increase at some point.