I tried first to approach CUDA from python stuff because of sage. I am doing intgrable models.
we solve system of non-linear pde using different methods, algebraic , geometric, topoloogical.
I first tried Mathematica cuda package, but it is written more than 10 years ago
and didn’t evolve too much since then. BIG mistake of WRI.
So, I played with python stuff in order “to couple” with sage and cython.
Numba has a limited CUDA functionality. CUPY is more oriented. They are nice
within their field of application = makes life easier.
Pycuda and Pyopencl were different for me, having a Ryzen 7, It is attractive
to write a code and run it simultaneously on AMD (built in) and Nvidia 1660Ti
(Boosted to 1850MHz)
That is said here my CUDA code , under Windows and WSL. They are the same
but Windows can go further a little bit .
The idea is easy:
1- Create a big zero array
2- send it to the GPU
3- the CUDA kernel goes over it at once and change to “ones”
4- sum it over the gpu and cpu
5- time every thing, kernel launch - copy -sum
I thought creating a zero matrix will use zero RAM. But, it seems that CUDA
doesn’t like dynamical allocation. It is very fair a matrix is a matrix
wether zero or not.
(BUT AT LEAST YOU SHOULD ALLOW SYMMETRIC or SKEW-SYMMETRIC)
Pycuda creates a big cache. If I am creating 5(+) GB Array, I don’t
know from task manager Pycuda and PyopenCL uses more than 15 GB ram
and sometime even fails. I have 16 GB of RAM and having fast SSD I increased my
pagefile to 8G, then 16 G
Here the first code