CUDA and Exponentiation

I think I asked something about this long ago, but I thought it might be useful to ask now that CUDA Fortran is here.

Namely, what is the best way to handle exponentiation in CUDA Fortran? I’m currently trying to transform an old F77 code into CUDA Fortran, and so I’m encountering many of those tricks you see in hand-optimized code. For example, in my code, a variable is raised to the 58th power by:

yy=x*x
yy1=yy*yy
yy1=yy1*yy1
yy2=yy1*yy1
yy3=yy2*yy2
y=yy*yy1*yy2*yy3

which uses 8 multiplies instead of a pow operation, which is probably a faster route than

y=x**58

when you are using a CPU. But, I began to wonder if, with CUDA, I’d be better off using the ** operator? First, perhaps the __powf call that might be used (if fastmath maps ** onto that, not sure) might execute in less cycles? Or, perhaps, that the reduced register pressure might be good in a sprawling code like mine (I use maxregcount so all extra registers go to global memory)?

I ask mainly because there are many hand-coded **8, **6, and even a **58 and a **21 in this program. True, the best answer is try both out and benchmark, but at the moment this code won’t compile with -Mcuda due to too much local memory use (I’m working on cutting that down) and some weird issue with constant memory that might need a bug report (the constant struct has hundreds, thousands of entries…which isn’t right).

Matt

PS: Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences.

Hi Matt,

Sorry, I have not looked at the performance of pow on a GPU so don’t know which is better.

What I do know is that the compiler will generate code that does the multiplication method for x**y where y <= 10. For values of y greater than 10, a ‘powf(x,y)’ (or pow is x is double precision) call is generated. If using “-Mcuda=fastmath”, ‘powf’ is replaced with the faster but less accurate “__powf” function.

Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences

With **21 the compiler is calling the ‘powf’ routine which is accurate to 5ulps (not very accurate). I don’t know the accuracy of ‘__powf’, but would assume it’s worse.

  • Mat

So, for x**8, even the GPU code will use the multiply method? Or is that only for CPU?

Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences

With **21 the compiler is calling the ‘powf’ routine which is accurate to 5ulps (not very accurate). I don’t know the accuracy of ‘__powf’, but would assume it’s worse.

Huh. I guess I’m surprised then that 58 is the same either way (that is x58 gives the same answer as the multiply route on a CPU). Perhaps, though, that isn’t being used or, with such a large exponent, the number is butting up against the single-precision limit near 0 or 1? Hmm.

Of course, all this will be answered if and when I can get the GPU code to compile. Though I’m guessing many a forum post and support email will occur before then!

Thanks,
Matt

Hi Matt,

So, for x**8, even the GPU code will use the multiply method? Or is that only for CPU?

Both actually.

Of course, all this will be answered if and when I can get the GPU code to compile. Though I’m guessing many a forum post and support email will occur before then!

Thankfully we don’t charge on a per question basis, so keep’m coming!

  • Mat