I have a code that evaluates a 15-degree polynomial on 3 variables, x, y and z (but with no exponent larger than 5); it has 216 terms.
I’m using Horner’s scheme for evaluation to avoid redundant computations, written all down (no loops). The coefficients are in constant memory (the coeff array below). The expression looks like this:
float eval(float x, float y, float z)
{
return x * (... very large sub expr here...)
+
y * (... slightly smaller sub expr here...)
+
z * (... much smaller sub expr here...)
+
coeff[0];
}
This is what I get when I try to compile with CUDA 3.2 RC2 with Thrust 1.3, for sm_13 target, on a 64-bit Linux system:
This happens with other, Thrust-free, kernels too.
The funny thing is if I use just the “x * (…)” part, commenting the other terms out, the kernel compiles just fine. Commenting out the “x * (…)” term and leaving the rest, also compiles. So instead of first computing the first term, and then going through the rest of the expression, the compiler seems to be interleaving the evaluation.
I already tried breaking down the expression into individual functions, force them to not be inlined, and a couple of other stuffs, but it seems impossible to keep the compiler from evaluating everything at once.
I ran out of ideas now. Other than perhaps break down the kernel into two, each evaluating half of the polynomial. Anyone had this kind of problem before?
Edit: forgot to add, without Horner’s scheme, with a naive polynomial evaluation, the compiler starts swapping on my 6 GB system and takes more than 24h (that’s how long I was willing to wait before killing it.)