speeding up calculation of natural exponentials..?

little_jimmy · August 13, 2014, 3:12pm

Hello,

I currently have an algorithm that must frequently calculate exponentials (exp())

I can reduce this to the calculation of a single exp() for the warp block, followed by a multiplication for the active threads of the warp block, instead of an exp calculation per active thread of the warp block

To my knowledge, exp-s are function-based, not instruction-based, and rather computationally expensive

Could one rather attempt to parallel compute exp’s (given that only 1 exp is needed per warp, rather than multiple), and would one see an improvement in calculation time?

sBc-Random · August 14, 2014, 9:54am

Have you narrowed down exp to be the bottleneck of your code for certain?
How accurate must your results be? - Are you using the fast math flag when compiling?

little_jimmy · August 14, 2014, 4:00pm

The algorithm must do a large number of iterations, and each iteration has exp calculations; hence, I am concerned about total calculation time, and the impact of the exp calculations thereon
The algorithm frequently runs into exp calculations during an iteration, and I can only imagine the impact it is going to have on total calculation time

SP base-2 logs, exps, roots, etc are already down to 6 clocks per instruction, so I can only think what a natural exp would clock (on average 100 clocks…?)

I am currently contemplating converting the input SP or DP value to a 16 decimal place decimal number, breaking that up into 4 sub-values, using the 4 sub-values to reference a pre-calculated exp look-up table stored in constant memory, and multiplying/ sum-scanning the 4 resultant values
The accuracy should be acceptable, and it should alleviate pressure from the SP/ DP units; I just need to verify that it would be faster/ cheaper

sBc-Random · August 15, 2014, 8:14am

Are you using -use_fast_math?

It would in theory call __expf(x) or __exp(x) (Whether you’re using float or double data) - a much faster, but slightly less accurate, variant. I would be surprised if it’s still too slow, but you should really benchmark it first. Try the kernel but replace y = exp(z); with y = z+1; and see what the speed difference is.

sBc-Random · August 15, 2014, 8:46am

for (int i=0; i<100; i++) func_run(x.data,betas.data,length,2);
CUDA(cudaDeviceSynchronize());
MSG("Time for exp : %6f seconds, ",timer.stop());

	timer.start();
	for (int i=0; i<100; i++) scalop_run(x.data,1,betas.data,length,2);
	CUDA(cudaDeviceSynchronize());
	MSG("Time for multiplication : %6f seconds",timer.stop());

First bit calculates x=exp(betas) 100 times, no reallocation of memory
Second bit calculates x = betas*1 100 times, no reallocation

test.cpp(108):Time for exp : 0.0016045 seconds
test.cpp(113):Time for multiplication : 0.0018002 seconds

Unless your code is nowhere near memory bound, don’t waste your time optimizing it. I’m still interested to see your code though.

little_jimmy · August 15, 2014, 11:25am

does the calculation time of y = exp(x) not somewhat depend on the value of x…?

and a comparison of 100 calculations is inconclusive, i would deem - i do not think you are going to see the difference in calculation time emerge with only 100 calculations

Still, I appreciate your feedback, and i do not rule out the possibility that you are right: that optimization would entail chasing wind…

without any exp(), the kernel that i am working on completed 100 kernel iterations, it took ±6000 seconds, and it completed millions upon millions of iterations per kernel (I should introduce a counter to count the number of iterations); now I want to introduce exp()…; I think there is valid reason to at least revise the exp()

I think I can calculate an exp() with a lookup table in under 100 clocks; I think 100 clocks for an ordinary exp() is somewhat ‘lenient’; what I like about the lookup approach is that it uses relatively cheap arithmetic (add, multiply), whereas the ordinary exp() calculation contains at least 1 division I think (already 10 - 20 or more clocks) - so you do not have 1 warp consuming the arithmetic units, stalling everything else

I shall post the exp calculation code when I am done, for your to evaluate and critique
And I think I should compare the calculation time with and without the new exp calculation code

sBc-Random · August 15, 2014, 12:59pm

I did the benchmark for you.

To take the exp of 800,000 individual elements each 100,000 times takes 0.1 seconds on a Titan.

To put it in perspective, the absolute FASTEST you could do a=a+b*c the same amount of times is 0.01 seconds, so it’s equivalent to doing 10 additions.

That means it is maximum one every TEN clock cycles, not 100.

Please keep in mind when coding on a gpu, typically you create the elementary functions (e.g vector operations), and then call them multiple times instead of hardcoding a kernel to have the iteration inside like that. The overhead of calling a kernel is in the microseconds.

little_jimmy · August 15, 2014, 2:37pm

“Please keep in mind when coding on a gpu, typically you create the elementary functions (e.g vector operations), and then call them multiple times instead of hardcoding a kernel to have the iteration inside like that. The overhead of calling a kernel is in the microseconds.”

I am running 30 - 75 instances of the same kernel concurrently; witch each kernel parallel implemented; I also then run the kernel itself for 100 iterations, with different input values

“That means it is maximum one every TEN clock cycles, not 100.”

This is from the c math library:

exp(x) = 2^k * exp(r)

x = kln2 + r, |r| <= 0.5ln2

exp(r)

```
                            r*R1(r)	
```

        = 1 + r + ----------- (for better accuracy)

```
                   2 - R1(r)
```
where
```
 	         2       4             10
```

 R1(r) = r - (P1*r  + P2*r  + ... + P5*r   ).

I have heard estimates for an exp calculation of about 100 clocks, or more