I tried first to approach CUDA from python stuff because of sage. I am doing intgrable models.

we solve system of non-linear pde using different methods, algebraic , geometric, topoloogical.

I first tried Mathematica cuda package, but it is written more than 10 years ago

and didn’t evolve too much since then. BIG mistake of WRI.

So, I played with python stuff in order “to couple” with sage and cython.

Numba has a limited CUDA functionality. CUPY is more oriented. They are nice

within their field of application = makes life easier.

Pycuda and Pyopencl were different for me, having a Ryzen 7, It is attractive

to write a code and run it simultaneously on AMD (built in) and Nvidia 1660Ti

(Boosted to 1850MHz)

That is said here my CUDA code , under Windows and WSL. They are the same

but Windows can go further a little bit .

The idea is easy:

1- Create a big zero array

2- send it to the GPU

3- the CUDA kernel goes over it at once and change to “ones”

4- sum it over the gpu and cpu

5- time every thing, kernel launch - copy -sum

I thought creating a zero matrix will use zero RAM. But, it seems that CUDA

doesn’t like dynamical allocation. It is very fair a matrix is a matrix

wether zero or not.

**(BUT AT LEAST YOU SHOULD ALLOW SYMMETRIC or SKEW-SYMMETRIC)**

Pycuda creates a big cache. If I am creating 5(+) GB Array, I don’t

know from task manager Pycuda and PyopenCL uses more than 15 GB ram

and sometime even fails. I have 16 GB of RAM and having fast SSD I increased my

pagefile to 8G, then 16 G

Here the first code