Now, let’s start with the (1000000)^2=10^12 case.
Actually, we can do it with Numba.
But first, let’s generate some reference point with Pari 1000,10000,100000
? parsum(x=1,1000,parsum(y=1,1000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 1,028 ms, real time = 75 ms.
%6 = -0.29452031608989259865270968298593082005
?
? parsum(x=1,10000,parsum(y=1,10000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 1min, 45,662 ms, real time = 7,302 ms.
%7 = -0.29452023400558052987449790196684298567
?
?
?
? parsum(x=1,100000,parsum(y=1,100000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 3h, 26min, 49,217 ms, real time = 13min, 4,862 ms.
%8 = -0.29452023318099672363387284014568673756
?
the last case, the 100000, took 13min and around 5 sec
with Numba
a=_time();zt2(1000);_time()-a
-0.29452031608989515
0.0
a=_time();zt2(10000);_time()-a
-0.29452023400565625
0.0
a=_time();zt2(100000);_time()-a
-0.2945202332482132
1.1132497787475586
and here it is
a=_time();zt2(1000000);_time()-a
-0.2945202331725871
112.3285620212555
not bad less than 2min, just 112sec, But we shall do much better when the symmetric group + hidden diophante relation (promoted) symmetry will be taken into account as the Golden Rule says
— Don’t Calculate Something Twice —
Now, Let’s openACC
*** PLEASE IF YOU HAVE BETTER SUGGESTION TEACH ME ***
#include<stdio.h>
#include <stdlib.h>
#include<math.h>
#include <openacc.h>
#include <accelmath.h>
int main(int argc, char * argv[])
{ double zeta=0,kk1,kk2;
long int i,j,N; //,N=5000;
if (argc <= 1)
{
printf ("Usage: %s <number> \n", argv[0]);
return 2;
}
N = atoi(argv[1]);
/*acc_init( acc_device_nvidia );*/
#pragma acc kernels
#pragma acc loop
for(i=1;i <=N;i++)
{
#pragma acc loop
for(j=1;j<=N;j++)
{
kk1=double(i*i); kk2=double(j*j);
zeta += ((kk1-kk2)*(kk1-kk2) -4*kk1*kk2)/((kk1+kk2)*(kk1+kk2)*(kk1+kk2)*(kk1+kk2));
}
}
printf("my zeta %d = %.25f \n",N,zeta);
return 0;
}
I know, for gpu usually it is not optimal, I don’t know for every different range
I should define every time gang, work, vector?!
nvc++ -acc=multicore -Minfo=accel -I include/ zetacacc.c -O4 -ozetacacc
I also tried -fast
It is ok, it is very comparable with Numba and the most imporant point
I have something to compare with Numba for higher values where Pari
is too long to compute (at least over a laptop)
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 1000
my zeta 1000 = -0.2945203160898906546982801
real 0m0.006s
user 0m0.000s
sys 0m0.021s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 10000
my zeta 10000 = -0.2945202340061441326213298
real 0m0.025s
user 0m0.313s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 100000
my zeta 100000 = -0.2945202337184757990229400
real 0m1.572s
user 0m25.074s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 1000000
my zeta 1000000 = -0.2945202331726640698761344
real 2m43.937s
user 43m41.286s
sys 0m1.380s
mabd@LAPTOP-T8DQ9UK0:~/zeta$
We need to do better because the true cases, the higher dimensional ones,
are numerically very expensive.
Compiling for the gpu
mabd@LAPTOP-T8DQ9UK0:~/zeta$ nvc++ -acc -Minfo=accel -cuda -I include/ zetacacc.c -O4 -ozetacaccg
main:
28, Loop is parallelizable
Generating implicit copy(zeta) [if not already present]
31, Loop is parallelizable
Generating Tesla code
28, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x /
Generating implicit reduction(+:zeta)
31, / blockIdx.x threadIdx.x auto-collapsed */
mabd@LAPTOP-T8DQ9UK0:~/zeta$
I tried to change from tesla to cc75 but i failed, i am using
mabd@LAPTOP-T8DQ9UK0:~/zeta$ nvc++ -acc -gpu=cc75 -Minfo=accel -cuda -I include/ zetacacc.c -fast -ozetacaccg
nvc+±Error-CUDA version 10.2 is not available in this installation.
nvc+±Error-CUDA version 10.2 is not available in this installation.
whatever i am using HPC 21.3
mabd@LAPTOP-T8DQ9UK0:~/zeta$ /opt/nvidia/hpc_sdk/Linux_x86_64/21.3/
The GPU case is slower so i shall to wrote a Pycuda raw kernel
I don’t have any profiler here nvprof is not working in WSL
Please guys, if you even pre-release version alpha -beta-anything.
OK for me I can test it for you.
Here the result of the GPU
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 1000
my zeta 1000 = -0.2945203160898925975885732
real 0m0.496s
user 0m0.022s
sys 0m0.098s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 10000
my zeta 10000 = -0.2945202340055809719920887
real 0m0.538s
user 0m0.040s
sys 0m0.119s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 100000
my zeta 100000 = -0.2945202331809971263432146
real 0m3.054s
user 0m2.532s
sys 0m0.101s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 1000000
my zeta 1000000 = -0.2945202331727466704691665
real 4m14.078s
user 4m13.702s
sys 0m0.090s
mabd@LAPTOP-T8DQ9UK0:~/zeta$
It is clear that there an extra time involved. But we generated new results
Numba-OpenACC(multi-core)-OpenACC(GPU)
Studying the symmetry involved with multi-loops will be extremely useful.
It is strange that it has not been done. It can be implemented with
compilers.
Here in the 2d case is easy TO SEE. The higher dimensional
case involves some imagination.
For, My Good Luck, I worked with GR and Kaluza-Klein before doing
my thesis in higher d (>2d) integrable models. The Generic case involvs Higher
rank tensor but the symmetric group action be easily calculated. Getting the additional
part is problem dependant not generic.