Belive or Not: SuperCuda is 10^6(+) faster than Mathematica

One year ago, I knew almost no cuda. Whatever i am programing c since 1990 and
mathematica since 1994. And many different progs/language over PC and
AVR PIC ARM. But I have never imagined that I shall beat Mathematica
by a factor of 6 order of magnitude.
Mathematica is faster 10-20 faste than maple. Pari single thread is faster
and exact. Pari multiple-thread under WSL now rocks faster 7 times than Pari
single thread. (with Ryzen 7 4800H)
But Numba (prange,njit) and C (multithread) were the fastest.
but to 10^-9 precession.
Now SUPERCUDA is the the new queen not just faster but even to
10^-15 accurate. You have done really a nice job.

I shall show you my zeta function calculation hundred of millions terms in milliseconds (taylor expansion). Now i am rewritting them in terms of
product i.e Theta functions. It should have better convergence properties
unbeleivable. Now, Numerical Analysis need to be rewritten and new methods
should be invented not the tradition way.

But, you need to simplify cuda initialization, allocation, and garbage collection.
also a some form of memory pool is needed
and need to write better manual Appendices E & M of C programming guide
deserve their own manuals. Also, changing the software profilers, PGprofiler, Visual Profiler, to Nsight is confusing. Moreover You, should work harder for Windows,
in the third world Windows is more popular.
You have compiler problems under Windows? use clang/LLVM.
But go away from pthread and posix stuff. It has never worked fine under Windows.

Gauss around 1797 calculated the lemniscate zeta and changed history. He published nothing except a cryptic sentence in the start of the famous section seven of his eternal
Disquitions Arithmitca. It needed form Humanity around thirty years and None other
than the great Abel (Haven’t you heared of Abel Medal) Abel whatever (as he described himself) pauvre comme un rat d’eglise wrote the legendry
Recherches Elliptiques and changed History of Mathematics. Elliptic functions
were officially born and then with Jacobi Fundamenta Nouva we entered a new chapter of Mathematics . Trying to understand Elliptic functions, complex analysis
was born. Not the contrary, now elliptic functions appears in a rather shy way as
the last chapter if any.

I will show you how to calculate the lemniscate (gaussian complex integer) then
how to calculate the integral quaternionic zeta using SUPERCUDA.
Wow, when i started these kind of calculation twenty years ago over a 650 MHZ
IBM/ThinkPad , all that i could calculate some few thousand Taylor series terms. Now, Thanks
to “Evolution” Cuda, Numba, OpenACC help me to calculate trillions terms of these zetas
and even their Thetas.

I shall prepare and show you the code. and we will time it.

I don’t know how to write equations in this forum. it seems that latex is not installed.
Anyway, We shall use Mathematica, I find version 9 with Parallel kit is the fastest.
In recent versions, only 8 not 16 kernels can be launched over Ryzen 7 4800H with
overclocked 1660Ti to 1850 MHz Boost speed (6GB DDR6). So we shall use shall Math 9 which is definetly double faster than 12.3. We shall need Pari single thread
can be downloaded for windows from their site. Then you should recompile your own multithreaded version under WSL. The Multithreaded version is 7x faster
than the single thread. Recompiling it under Windows with cygwin shell doesn’t
help. posix stuff never worked fine under windows. Actually, Recompiled multithreaded Pari under Windows is slower than the single thread version.
So Pari will be used to get EXACT result for comparaison as a reference point.

Then, I shall need WSL with gcc and clang/LLVM. Under Windows, I shall
use Sage shell which is cygwin(gcc) shell and i have clang/LLVM is installed
and Visual Studio or more precisely cl (Microsoft C compiler) so we can compare
our WSL vs Windows c compiler.

HPC is installed under WSL, for sure, we going to use nvc++ with -acc (multicore and/or gpu).
It is not optimal for the GPU but it works fine for the CPU.
Then, we shall need Numba and Pycuda, You can compare also with PyopenCL.
whatever Nvidia with Pycuda is better than Nvidia with PyopenCL (yes, true, in caching
and memory manegment) . Julia with distributed can be used also it is fast whaterever slower than Numba. Normal Sage is very slow even slower than normal Python
but Cython is fast as fast as gcc whatever it works better with the complex case
than with the quaternionic case. With two Loops sage performs better than with 4 Loops. gcc is here with 4 loops better than cython!!!

To make life easy I shall strip most of the Math and can be found latter in the paper
(under praperation) It might be helpfull that you know something about quaternionic structure in terms of 4x4 Matrices. We shall see what Math should be introduced along the way. We shall strart with the easy complex case the Lemniscate. The complex Zeta over the Gaussians

I don’t know for Equations, I should put them as pictures or there is some way to Latex them.

It would nice really to make nvc++ available for windows to compare it against
WSL or Linux. We shall Launch multimillions kernels. I wish I can Launch a Trillion
Kernel. Theoretically CUDA supports 2^73 Kernels. I am very far from this number.

For those who might be interested you shall need some basic material about Elliptic functions. :
may be
mmono170 elliptic functions prasolov solovyev
Elliptic Functions and Elliptic Integrals (ams.org)

Elliptic Functions and Elliptic Integrals

is enough for now, Also

Elliptic Functions according to Eisenstein and Kronecker

Elliptic Functions according to Eisenstein and Kronecker | SpringerLink

is important because Eisenstein (notice the extra s) methods fit well to GPU
and CUDA

chapter 4 of Prasolov & Solovyev is the Lemniscate

I shall write down the Math and the progs (mathematica,pari,numba, pycuda, c)
together then latter I shall combine them and clean everything in one Latex/PDF
file.

may be a bit of history might be relevant

A. I. Markushevich (Auth.) - The Remarkable Sine elliptic Functions lemniscate (1966, Elsevier Inc)

or chapter one of

AMS eBooks: Translations of Mathematical Monographs

Introduction to the Classical Theory of Abelian Functions

Let’s code:
ok, I hope those who might be interested at least googled for the lemniscate.
Now, Gauss solved the problem by inverting the Lemniscate integral in terms
of the AGM. But Let’s do it in zeta:
we shall need to evaluate double sum over the Gaussians(=complex numbers over the integers)

so we have something like (i is the imaginary unit)
(sum_{x=-inf}^{inf} sum_{y=-inf}^{inf} )’ 1/(x+ i y)^4
the dash on the double sum means singularity is excluded

now, you can calculate by hand or use Mathematica to find explicit form
of 1/(x+ i y)^4 (I shall use Math: to indicate Mathematica input)
Math: ComplexExpand[1/(x + Iy)^4]
or better separate the real and imaginary part
Math: ComplexExpand[Re[1/(x + I
y)^4]]
Math: ComplexExpand[Im[1/(x + I*y)^4]]

Clearly the imaginary part is skew symmetric and shall get canceled in the symmetric summation

So, We are left with the Real Part. Actually it is good test of Complex numbers
“packages” to do it first with the imaginary, you will get “A WRONG” result
a residual part due to floating point “computer aithmetic”. We shall give an exaple
for testing (latter)
Now if you have done the Real part clacultion above you will finish
with x^4 - 6 x^2 y^2 + y^4 / norm^4
where the norm = x^2+y^2

Hint to do it by hand in case you don’t have Mathematica
1/(x+i y)= (x - i y) / (x^2 + y^2)
For sure you already know this. so you can see the norm part and the real
part coming from (x - i y)^4 That is an easy calculation for first year Eng. or CS

So we have to sum of the real part of course we are working numerically so
instead of -inf or inf, we should you use big numbers or depending on the convergence rate and your
float point precession you have an active factor
after it you shall not see any change

Now what does it mean in mathematica first
double sum in mathematica is not sum[sum[…]]
is directly sum[…{},{}]
and it should be also clear that the symmetric sum(dashed) will be reduced
to sum_{x=1}^{nn} _{y=1}^{nn}
So, easy now it is just two loops (in the computer sense)
Math:nn = 10; Sum[(x^4 - 6 x^2 y^2 + y^4)/(x^2 + y^2)^4, {x, 1, nn}, {y, 1,
nn}]

you will get big rational number since mathematica is exact, now
we shall work numerically with 20 precession only (20 is higher than C(CUDA)
double or Numba double. You shall see in moment CUDA vs. C vs. Numba
double precession .
so now
Math: nn = 10; Sum[
N[(x^4 - 6 x^2 y^2 + y^4)/(x^2 + y^2)^4, 20], {x, 1, nn}, {y, 1, nn}]
Output: -0.29498943955853630946

of course 10 is nothing and we need Timing

Math: nn = 10;
Sum[N[(x^4 - 6 x^2 y^2 + y^4)/(x^2 + y^2)^4, 20], {x, 1, nn}, {y, 1,
nn}] // AbsoluteTiming

Output: {0.000781, -0.29498943955853630946}

so Mathematica took 0.000781 secs
IF you try to push things bigger and bigger it will take more and more time
BUT WAIT , WHERE is the multithreading
in Mathematica it is easy sum → ParallelSum but
in Mathematica also indicates how many kernels
Math:LaunchKernels[16]
you launched just once (check the taskManager)
Math:nn = 1000;
ParallelSum[
N[(x^4 - 6 x^2 y^2 + y^4)/(x^2 + y^2)^4, 20], {x, 1, nn}, {y, 1,
nn}] // AbsoluteTiming

Now theOutput for nn=1000 from Mathematica 9 and Mathematica 13.1
Output13.1:{1.15557, -0.29452031608989259865}
Output9:{0.734036, -0.29452031608989259865}
To be fair with Mathematica (I love it) I shall use Math 9.
Now: Since it is not common knowldge Pari is the official fast number cruncher
you explicitly indicate how many precession and then work
you can download (Windows in my case) single thread
in Pari first turn on the timer
(03:45) gp > default(timer,1)
then
(03:46) gp > sum(x=1,100,sum(y=1,100, (x^4 - 6*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
the output is big number life is easy here just replace 6 by 6.

(03:48) gp > sum(x=1,100,sum(y=1,100, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
time = 16 ms.
%3 = -0.29452815542685634146847746155357776004
you can get more precession if you like but that is more than enough for our case
and to make comparable with mathematica
here
(03:49) gp > \p 25
(03:50) gp > sum(x=1,100,sum(y=1,100, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
time = 15 ms.
%4 = -0.2945281554268563414684775
ok let 's do it the 1000
(03:50) gp > sum(x=1,1000,sum(y=1,1000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
time = 532 ms.
%5 = -0.2945203160898925986527097

ms=milliseconds i.e 532ms = .532 secs
Math9(16 Threads) was 0.734036 while Math 13.1(16 Threads) is 1.15557
over my 2020 laptop Ryzen 7 4800H.
Of course you have better machines.
Now I compiled Pari from Source over WSL with --tune and --mt=pthread
This is what I shall use the static version (gp-sta)

mabd@LAPTOP-T8DQ9UK0:~$ ./gp-sta
                                         GP/PARI CALCULATOR Version 2.13.1 (released)
                                 amd64 running linux (x86-64/GMP-6.2.1 kernel) 64-bit version
                            compiled: May 12 2022, gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
                                                   threading engine: pthread
                                        (readline v7.0 enabled, extended help enabled)

                                            Copyright (C) 2000-2020 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.

Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.

parisize = 8000000, primelimit = 500000, nbthreads = 16
? default(timer,1)
? \p 25
? parsum(x=1,1000,sum(y=1,1000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 984 ms, real time = 437 ms.
%2 = -0.2945203160898925986527097
? parsum(x=1,1000,parsum(y=1,1000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 1,009 ms, real time = 72 ms.
%3 = -0.2945203160898925986527097
?

you should use two parsum
Now you can see the parallel version made the 1000 case in 72 ms = 0.072 sec

so now let’s C

Pari site: from Bordeauxhttps://pari.math.u-bordeaux.fr/download.html

Ok, Let’s Numba first and I can show you some “Float Point Anomaly”
Two math expression that are mathematically equivalent but not
numerically “float point” equivalent.

from numba import njit,prange,float64
from time import time as _time
@njit(parallel=True,fastmath=True,nogil=True)
def zt2(nn):
  sum =0.
  for i1 in prange(1,nn+1):
   for i2 in prange(1,nn+1):
    kk1 = float64(i1*i1)
    kk2 = float64(i2*i2)
    sum += (kk1*kk1 -6.*kk1*kk2 + kk2*kk2) / ((kk1+kk2)*(kk1+kk2)*(kk1+kk2)*(kk1+kk2))
  return sum

a=_time();zt2(1000);_time()-a
-0.29452031608989515
0.0
a=_time();zt2(10000);_time()-a
-0.29452023400565625
0.015344619750976562
a=_time();zt2(100000);_time()-a
-0.2945202332482132
1.097581148147583
a=_time();zt2(200000);_time()-a
-0.294520233189406
4.391449451446533
a=_time();zt2(300000);_time()-a
-0.2945202331733986
9.945176124572754
a=_time();zt2(500000);_time()-a
-0.29452023317280857
29.442336082458496

Only GOD knows how long Mathematica takes to do the 500000 case. Numba can do it in 30 secs. We shall extrapolate later.

Now, Something interesting

@njit(parallel=True,fastmath=True,nogil=True)
def zt2m(nn):
 sum =0.
  for i1 in prange(1,nn+1):
   for i2 in prange(1,nn+1):
    kk1 = float64(i1*i1)
    kk2 = float64(i2*i2)
    sum += ((kk1-kk2)*(kk1-kk2)-4.*kk1*kk2) / (kk1+kk2)**4
  return sum

a=_time();zt2m(1000);_time()-a
-0.29452031608989065
0.0
a=_time();zt2(1000);_time()-a
-0.29452031608989515
0.0
a=_time();zt2m(10000);_time()-a
-0.29452023400614413
0.015372991561889648
a=_time();zt2(10000);_time()-a
-0.29452023400565625
0.015612602233886719
a=_time();zt2m(100000);_time()-a
-0.2945202337184758
1.5366814136505127
a=_time();zt2(100000);_time()-a
-0.2945202332482132
1.0973074436187744
a=_time();zt2m(500000);_time()-a
-0.29452023317285664
39.00039505958557
a=_time();zt2(500000);_time()-a
-0.29452023317280857
27.96574878692627

Now, you see they are different. ALWAYS RECHECK YOURSELF
I want to sleep… c u

Dear Moderator if you intend to delete this topic PLEASE, warn me first.
I don’t have any copy of it.

I shall show you the complex case latter, we shall have more “Float Point Residual Effect”,
Actually, try your favorite c compiler gcc/clang-llvm/cl-msvc
and jut compute 5^27 with integers and double. what you will get?!!
Marvelous every idiot knows that 5^n terminates with five except c compilers
actually not just in c but also in python and may be everywhere i found this while
testing amd advanced fast math lib but since then i found it universal.

Now, We should decide which formula to use, things get more nasty with quaternions and higher dimensional cases octonions, lie algebra representations, Kac-Moody … etc

Here, we can judge which formula is the correct using multithreaded Pari, I am using it under WSL. Be patient with me. the story is long and I am already making a lot of cuts.

Sorry of the infinity of the typos starting from the title. Knowing that C is C
and finding C compilers are not perfect then Do not expect (i wrote it except first) something higher from mortals.

Ok; It is straight forward now, to write the C code; I shall work on WSL.
. I have tested under Windows and WSL.
My compilers version are under
1: WSL Ubuntu 18.04
a) gcc
mabd@LAPTOP-T8DQ9UK0:~$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: …/src/configure -v --with-pkgversion=‘Ubuntu 7.5.0-3ubuntu1~18.04’ --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

b)
mabd@LAPTOP-T8DQ9UK0:~$ clang -v
clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/…/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/bin/…/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/bin/…/lib/gcc/x86_64-linux-gnu/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/bin/…/lib/gcc/x86_64-linux-gnu/7.5.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda, version 7.0

after exporting hpc PATH

mabd@LAPTOP-T8DQ9UK0:~$ export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/21.3/compilers/bin:$PATH
mabd@LAPTOP-T8DQ9UK0:~$ export PATH=/opt/nvidia//hpc_sdk/Linux_x86_64/21.3/cuda/11.2/bin:$PATH

c)
mabd@LAPTOP-T8DQ9UK0:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:32:09_PST_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0

d)
mabd@LAPTOP-T8DQ9UK0:~$ nvc++ -V
nvc++ 21.3-0 LLVM 64-bit target on x86-64 Linux -tp zen
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

When needed, I shall inform you of my Windows.
but we are not here to test/compare compilers

So, let 's C and then use OpenACC/HPC/WSL to generate the multithread cpu/gpu

#include<stdio.h>
#include <stdlib.h>
int main(int argc, char * argv[])
{               double zeta=0,kk1,kk2;
                long int i,j,N;    //,N=5000;

                if (argc <= 1)
                {
                        printf ("Usage: %s <number> \n", argv[0]);
                        return 2;
                }
                N = atoi(argv[1]);
                for(i=1;i <=N;i++)
                     {
                        for(j=1;j<=N;j++)
                        {
                        kk1=i*i;kk2=j*j;
                        zeta += (kk1*kk1 -6.*kk1*kk2 + kk2*kk2) / ((kk1+kk2)*(kk1+kk2)*(kk1+kk2)*(kk1+kk2));
                        }
                }
                printf("my zeta %ld = %.25f \n",N,zeta);
                return 0;

Nothing fancy, the same code like Numba but just change the compiler and
compile for speed

gcc -Ofast zetaC.c -oz2_gcc

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./z2_gcc 1000
my zeta 1000 = -0.2945203160898910987874899

real 0m0.002s
user 0m0.002s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

C is C

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./z2_gcc 10000
my zeta 10000 = -0.2945202341058930084471967

real 0m0.121s
user 0m0.121s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

Timing is nice things get slower when you shoot high

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./z2_gcc 100000
my zeta 100000 = -0.2945202331726006761414283

real 0m11.752s
user 0m11.752s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

You should compare it with Pari and remember this just the single threaded C

Now, by every sense of the word, Thanks to WSL/MS and nvc++/Nvidia
You opened for us a new world. But, this is just the complex, i am not surprised.
The true gain is latter Let me do the nvc++ in a seperate reply. it deserve it
You don’t calculate one trillion term over a laptop in a late reply.
We should show respect.

First and For Most:
I still can do better. But it is beyond basic computing skills.
It is the Math behind. I shall call this

  • The Eternal Parallel Computing Theorem -
    Theorem: Parallel Loops admit a symmetric group action

Parallelizable loops, by definition, should have no preferred order but this is nothing but a symmetric group action.

we shall see it latter, initially I wanted to do the math first, but from my experience
in the last years with eng. and comp. progs., I try to explain for them the math
and they have one single nagging request: give me an example. latter we shall stop an summarize everything after finishing coding the complex case.

OK. We have to OpenACC pragma and recompiling twice, using nvc++, once
for gpu and another for multicore.

Now, let’s start with the (1000000)^2=10^12 case.
Actually, we can do it with Numba.
But first, let’s generate some reference point with Pari 1000,10000,100000

? parsum(x=1,1000,parsum(y=1,1000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 1,028 ms, real time = 75 ms.
%6 = -0.29452031608989259865270968298593082005
?
? parsum(x=1,10000,parsum(y=1,10000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 1min, 45,662 ms, real time = 7,302 ms.
%7 = -0.29452023400558052987449790196684298567
?
?
?
? parsum(x=1,100000,parsum(y=1,100000, (x^4 - 6.*x^2 * y^2 + y^4)/(x^2 + y^2)^4))
cpu time = 3h, 26min, 49,217 ms, real time = 13min, 4,862 ms.
%8 = -0.29452023318099672363387284014568673756
?

the last case, the 100000, took 13min and around 5 sec

with Numba

a=_time();zt2(1000);_time()-a
-0.29452031608989515
0.0
a=_time();zt2(10000);_time()-a
-0.29452023400565625
0.0
a=_time();zt2(100000);_time()-a
-0.2945202332482132
1.1132497787475586

and here it is

a=_time();zt2(1000000);_time()-a
-0.2945202331725871
112.3285620212555
not bad less than 2min, just 112sec, But we shall do much better when the symmetric group + hidden diophante relation (promoted) symmetry will be taken into account as the Golden Rule says
— Don’t Calculate Something Twice —

Now, Let’s openACC

*** PLEASE IF YOU HAVE BETTER SUGGESTION TEACH ME ***

#include<stdio.h>
#include <stdlib.h>
#include<math.h>
#include <openacc.h>
#include <accelmath.h>


int main(int argc, char * argv[])
{               double zeta=0,kk1,kk2;
                long int i,j,N;    //,N=5000;

                if (argc <= 1)
                {
                        printf ("Usage: %s <number> \n", argv[0]);
                        return 2;
                }
                N = atoi(argv[1]);
                /*acc_init( acc_device_nvidia );*/
                #pragma acc kernels
                #pragma acc loop
                for(i=1;i <=N;i++)
                     {
                        #pragma acc loop
                        for(j=1;j<=N;j++)
                        {
                           kk1=double(i*i); kk2=double(j*j);
                           zeta += ((kk1-kk2)*(kk1-kk2) -4*kk1*kk2)/((kk1+kk2)*(kk1+kk2)*(kk1+kk2)*(kk1+kk2));
                        }
                      }
                printf("my zeta %d = %.25f \n",N,zeta);
                return 0;
}

I know, for gpu usually it is not optimal, I don’t know for every different range
I should define every time gang, work, vector?!

nvc++ -acc=multicore -Minfo=accel -I include/ zetacacc.c -O4 -ozetacacc

I also tried -fast

It is ok, it is very comparable with Numba and the most imporant point
I have something to compare with Numba for higher values where Pari
is too long to compute (at least over a laptop)

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 1000
my zeta 1000 = -0.2945203160898906546982801

real 0m0.006s
user 0m0.000s
sys 0m0.021s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 10000
my zeta 10000 = -0.2945202340061441326213298

real 0m0.025s
user 0m0.313s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 100000
my zeta 100000 = -0.2945202337184757990229400

real 0m1.572s
user 0m25.074s
sys 0m0.000s

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 1000000
my zeta 1000000 = -0.2945202331726640698761344

real 2m43.937s
user 43m41.286s
sys 0m1.380s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

We need to do better because the true cases, the higher dimensional ones,
are numerically very expensive.

Compiling for the gpu

mabd@LAPTOP-T8DQ9UK0:~/zeta$ nvc++ -acc -Minfo=accel -cuda -I include/ zetacacc.c -O4 -ozetacaccg
main:
28, Loop is parallelizable
Generating implicit copy(zeta) [if not already present]
31, Loop is parallelizable
Generating Tesla code
28, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x /
Generating implicit reduction(+:zeta)
31, /
blockIdx.x threadIdx.x auto-collapsed */
mabd@LAPTOP-T8DQ9UK0:~/zeta$
I tried to change from tesla to cc75 but i failed, i am using
mabd@LAPTOP-T8DQ9UK0:~/zeta$ nvc++ -acc -gpu=cc75 -Minfo=accel -cuda -I include/ zetacacc.c -fast -ozetacaccg
nvc+±Error-CUDA version 10.2 is not available in this installation.
nvc+±Error-CUDA version 10.2 is not available in this installation.

whatever i am using HPC 21.3
mabd@LAPTOP-T8DQ9UK0:~/zeta$ /opt/nvidia/hpc_sdk/Linux_x86_64/21.3/

The GPU case is slower so i shall to wrote a Pycuda raw kernel
I don’t have any profiler here nvprof is not working in WSL
Please guys, if you even pre-release version alpha -beta-anything.
OK for me I can test it for you.

Here the result of the GPU

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 1000
my zeta 1000 = -0.2945203160898925975885732

real 0m0.496s
user 0m0.022s
sys 0m0.098s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 10000
my zeta 10000 = -0.2945202340055809719920887

real 0m0.538s
user 0m0.040s
sys 0m0.119s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 100000
my zeta 100000 = -0.2945202331809971263432146

real 0m3.054s
user 0m2.532s
sys 0m0.101s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 1000000
my zeta 1000000 = -0.2945202331727466704691665

real 4m14.078s
user 4m13.702s
sys 0m0.090s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

It is clear that there an extra time involved. But we generated new results
Numba-OpenACC(multi-core)-OpenACC(GPU)

Studying the symmetry involved with multi-loops will be extremely useful.
It is strange that it has not been done. It can be implemented with
compilers.
Here in the 2d case is easy TO SEE. The higher dimensional
case involves some imagination.
For, My Good Luck, I worked with GR and Kaluza-Klein before doing
my thesis in higher d (>2d) integrable models. The Generic case involvs Higher
rank tensor but the symmetric group action be easily calculated. Getting the additional
part is problem dependant not generic.

This is the result of GPU-Z for the GPU 1000000 case, setting Highest Value
As shown GPU load 100% but Memory Used is negligible. It didn’t generate a lot
of data. But worked “CPU like”

I know we can do better even with Pycuda
the idea is simple to generate our zeta as big very big array and sum in parallel.
It is not only the sum in parallel but more important the generation of the zeta
elements in parallel. then we will see how symmetry can help us enormously
so here first fast cuda faster than openACC

Let’s blow up the GPU.
Here, this stolen piece of code from some Pycuda
and just modify it. For two important purposes.
1- measure kernel time. 2- see float-point precession
I don’t like this code but i shall modify later


import pycuda.driver as cuda
import pycuda.autoinit  # noqa
from pycuda.compiler import SourceModule
from time import time as _time
import numpy
nn=32
gd=836    # 496-512-640
dd=32*gd
a = numpy.empty(dd*dd).reshape(dd,dd)
a = a.astype(numpy.float64)
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)

mod = SourceModule("""
    __global__ void zetac(double *a)
    {
      int id = blockIdx.x*32*32 + threadIdx.y*32 + threadIdx.x;
      int y = id/(32*496) +1;
      int x = id%(32*496) +1;
      double kk1 = double(x)*double(x);
      double kk2 = double(y)*double(y);
      a[id] = ((kk1-kk2)*(kk1-kk2)-4.*kk1*kk2) / (double(x*x+y*y)*(x*x+y*y)*(x*x+y*y)*(x*x+y*y));
    }
    """)
tt1=_time()
func = mod.get_function("zetac")
func(a_gpu,block=(nn, nn, 1),grid=( gd*gd, 1, 1), shared=0)
tt2=_time()
print("func gpu time is =",tt2-tt1)
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
tt3=_time()
print("memcpy time is =",tt3-tt2)
print("last element =",a_doubled[nn*gd-1,nn*gd-1])
print("dim =",a_doubled.shape)
tt4=_time()
print("sum =",a_doubled.sum())
tt5=_time()
print("sum time over the cpu  =  ",tt5-tt4)

Nothing, Pycuda standard example.

when I run it under Windows

(base) D:.Nvidia>python zetac_cuda.py
func gpu time is = 0.0
memcpy time is = 1.076991081237793
last element = 2.406572890177724e-19
dim = (26752, 26752)
sum = -0.29452023321826115
sum time over the cpu = 0.8240005970001221

(base) D:.Nvidia>

I summed up over the cpu not the cpu but i generated the lemniscate over the gpu first

when i got
func gpu time is = 0.0
i was puzzled it is small too small like smaller than milliseconds
or it is buggy.
since I have a profiler under Windows

(base) D:.Nvidia>nvprof python zetac_cuda.py
==5440== NVPROF is profiling process 5440, command: python zetac_cuda.py
func gpu time is = 0.0009996891021728516
memcpy time is = 1.078988790512085
last element = 2.406572890177724e-19
dim = (26752, 26752)
sum = -0.29452023321826115
sum time over the cpu = 0.8040022850036621
==5440== Profiling application: python zetac_cuda.py
==5440== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 46.26% 901.55ms 1 901.55ms 901.55ms 901.55ms [CUDA memcpy DtoH]
44.64% 869.79ms 1 869.79ms 869.79ms 869.79ms [CUDA memcpy HtoD]
9.10% 177.33ms 1 177.33ms 177.33ms 177.33ms zetac
API calls: 45.19% 1.07907s 1 1.07907s 1.07907s 1.07907s cuMemcpyDtoH
44.24% 1.05619s 1 1.05619s 1.05619s 1.05619s cuMemcpyHtoD
6.66% 158.99ms 1 158.99ms 158.99ms 158.99ms cuCtxCreate
1.82% 43.432ms 1 43.432ms 43.432ms 43.432ms cuCtxDetach
1.58% 37.707ms 1 37.707ms 37.707ms 37.707ms cuMemFree
0.50% 11.875ms 1 11.875ms 11.875ms 11.875ms cuMemAlloc
0.01% 199.90us 1 199.90us 199.90us 199.90us cuModuleLoadDataEx
0.00% 74.500us 1 74.500us 74.500us 74.500us cuLaunchKernel
0.00% 66.200us 1 66.200us 66.200us 66.200us cuModuleUnload
0.00% 24.500us 3 8.1660us 900ns 12.400us cuCtxPopCurrent
0.00% 5.5000us 3 1.8330us 300ns 4.1000us cuCtxPushCurrent
0.00% 5.3000us 3 1.7660us 200ns 3.7000us cuDeviceGetCount
0.00% 4.7000us 2 2.3500us 400ns 4.3000us cuDeviceGet
0.00% 3.5000us 1 3.5000us 3.5000us 3.5000us cuModuleGetFunction
0.00% 2.8000us 2 1.4000us 500ns 2.3000us cuCtxGetDevice
0.00% 2.5000us 1 2.5000us 2.5000us 2.5000us cuFuncSetBlockShape
0.00% 1.4000us 3 466ns 200ns 900ns cuDeviceGetAttribute
0.00% 1.1000us 1 1.1000us 1.1000us 1.1000us cuDeviceComputeCapability

(base) D:.Nvidia>

                0.00%  74.500us         1  74.500us  74.500us  74.500us  cuLaunchKernel

that is the first thing that i need from this ugly code.
for the second job the value of the sum over the cpu
it is
sum = -0.29452023321826115 this is (II-Py)
sum time over the cpu = 0.8159992694854736

ok Everytime I give a pari value it was to compare whatever not explicitly mentioned
Now, Let’s make it explicit

? default(timer,1)
? \p 25
? parsum(x=1,26752,parsum(y=1,26752, (x^4 - 6. *x^2 * y^2 + y^4) / (x^2 + y^2)^4 ))
cpu time = 12min, 46,118 ms, real time = 49,809 ms.
%3 = -0.2945202332890831296994150

it tooks 49.8 sec to give me (I-P) -0.2945202332890831296994150

We have numba and ACC gpu and CPU
Numba

from numba import njit,prange,float64
from time import time as _time
@njit(parallel=True,fastmath=True,nogil=True)
… def zt2(nn):
… sum =0.
… for i1 in prange(1,nn+1):
… for i2 in prange(1,nn+1):
… kk1 = float64(i1 * i1)
… kk2 = float64(i2 * i2)
… sum += ((kk1-kk2) * (kk1-kk2)-4. *kk1 * kk2) / ((kk1+kk2) * (kk1+kk2) *
(kk1+kk2) * (kk1+kk2))
… return sum

a=_time();zt2(26752);_time()-a
-0.29452023330627203
0.07782530784606934

extremely impressive timing but we need the value

(III-N) -0.29452023330627203

Actually, pure C single thread : gcc and clang/llvm can also be included

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./z2_cl 26752
my zeta 26752 = -0.2945202331726006761414283

real 0m0.850s
user 0m0.850s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./z2_gcc 26752
my zeta 26752 = -0.2945202331726006761414283

real 0m0.851s
user 0m0.851s
sys 0m0.000s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

(IV-C) -0.2945202331726006761414283

ACC/CPU
mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacacc 26752
my zeta 26752 = -0.2945202334266599497603067

real 0m0.120s
user 0m1.805s
sys 0m0.010s

(V-ACCCPU) -0.2945202334266599497603067

ACC/GPU

mabd@LAPTOP-T8DQ9UK0:~/zeta$ time ./zetacaccg 26752
my zeta 26752 = -0.2945202332890828333944455

real 0m0.779s
user 0m0.230s
sys 0m0.182s
mabd@LAPTOP-T8DQ9UK0:~/zeta$

Not the best timming
(VI-ACCGPU) -0.2945202332890828333944455

Actually We can do it even with Mathematica but the value will be = Pari

Let’s Compare I shall include time btween <<…>> in sec

(I-P) -0.2945202332890831296994150 << 49.8 >> <<<<<----------------
(II-Py) -0.2945202332…1826115 <>
(III-N)-0.294520233…30627203 <<0.07782530784606934>>
(IV-C) -0.294520233…1726006761414283 <<0.85>>
(V-ACCCPU) -0.294520233…4266599497603067 <<0.12>>
(VI-ACCGPU) -0.29452023328908…28333944455 <<0.779>> <<<<----------

NOW, WE can recognize the true from the false WINNER
But, we were not fair with II-Py

BY the way, I have done with sage-julia and of course Mathematica and PyopenCL

To be seen.
It is clear that CUDA deserves better attention from mathematicians and CS
should do better and make it easier.
OK, we shall continue

I am extremly sorry, I haven’t realized that the forum editor
changes some of my codes and removes some “stars” That is too bad.
Ok. Later I should reupload these little progs.
Please, reconsider a latex editor for this forum.

ok, to understand and to better appreciate CUDA. Let’s Flatten our zeta.
I always believed that multithreading can do more. But it wasn’t easy ride at all.

I started playing serously with multithread around 2003, It gives you the impression
you are doing everything “by your hand”.

Then Maeder offered his Parallel kit with Mathematica 7, and it becomes
even better with 8,9. I used it almost everywhere. It is great and made my life
easier. But with new laptops 4core/8threads, 2012+, I started to wonder,
shouldn’t we change our sequential algorithms not translate them. And, this really the point where CUDA shines.

Doing the sum is not bu=ig deal. But generate the elements is the true problem
Let’s see what happens
Instead of using Sum or ParallelSum in Mathematica, we shall use Table or ParallelTable.
I am sorry guys I can not explain. You should do it your own. There are infinity
of books about group theory or symmetric groups My favourite one
is
(CRC Press series on discrete mathematics and its applications) Méliot, Pierre-Loic-Representation theory of symmetric groups-Chapman and Hall_CRC (2017)

but that might be too advanced for you. I shall do it now, using Mathematica
we shall find the group action and compute the orbits and shall see the redenducy.
I told you I love Mathematica.

ok, to understand and to better appreciate CUDA. Let’s Flatten our zeta.
I always believed that multithreading can do more. But it wasn’t easy ride at all.

I started playing serously with multithread around 2003, It gives you the impression
you are doing everything “by your hand”.

Then Maeder offered his Parallel kit with Mathematica 7, and it becomes
even better with 8,9. I used it almost everywhere. It is great and made my life
easier. But with new laptops 4core/8threads, 2012+, I started to wonder,
shouldn’t we change our sequential algorithms not translate them. And, this really the point where CUDA shines.

Doing the sum is not big deal. But generate the elements is the true problem
Let’s see what happens
Instead of using Sum or ParallelSum in Mathematica, we shall use Table or ParallelTable.
I am sorry guys I can not explain Group Theory. You should do it in your own. There are infinity
of books about group theory or symmetric groups My favourite one is

(CRC Press series on discrete mathematics and its applications) Méliot, Pierre-Loic-Representation theory of symmetric groups-Chapman and Hall_CRC (2017)

but that might be too advanced for you. I shall do it now, using Mathematica
we shall find the group action and compute the orbits and shall see the redenducy.
I told you I love Mathematica.