Dask is another big winner for WSL

k.glimps · May 4, 2024, 9:53am

May be it is time for some one to write a book about cuda in mathematics and physics.
I have been teaching Phys/Math for the last 15+ years in CS and/or Eng.

At the start, i have found learning cuda is confusing. There is a lot of material
accumulated along the years and Nvidia changed their mind (in the good way) over the years.
From pure C cuda to python cuda. And, as it is well known python is already a very
big community. Morover, at the start, WSL was already confusing. I had to register in the
Windows Insider and the old problem about drivers but then I found learning cuda is
very chaotic. Different approachs and each one claiming, it is the best. Doing it in pure
c is a step backward, no garbage collection no dynamical arrays and all old problems
that were “solved” by python.

Now, I understand Nvidia is investing a lot of money in the AI. Not bad, Tensors and Linear Algebra
are extremely useful everywhere.
I found nvc++ is is a big winner for WSL. To be honest I like to see cython.
How can i mix nvc++ and cython together in an effective way. how to pass nvc++ decorators
to accelerate cython.

Numba is doing good in making cuda-kernels in pythonwhich is very good. Writing big cuda
kernels in c is not effective and debugging is problamatic.

It was impressive to see how it is easy to work with big arrays (bigger than the RAM)
with Dask. I am expecting more form cuDF. It shall be nice to have a hetrogeneous array
defined over the CPU and the GPU. and accelerated sperately over the CPU/GPU and then
fused together using sum/reduce.

Usually i found your mini-courses 4 lectures with sildes(pdf) and notebooks over github is the
best way.

It is extremly usefull for all of us to hear from Nvidia about what is better. After all you are
spending 100% of your time doing cuda so you solved more problems.

But scientific applications is still lacking behind. It shall be nice to see cython or even sage
with cuda. Parallel decorator in sage is usefull and quite effective. I have used with some-gigantic
g=7 riemann surface (RS) calculations. RS is the big sister of sin/cos from simply periodic into
multiply periodic, very extensive calculation and extremely needed. You can not solve the simple
pendulum with sin/cos, you shall need doubly-periodic, not to mention spin top and other
3 and higher dimensionals applications like in solitons.

Now, i am spending my spare time with Dask, cupy, numba, cuDF.

Thank you for all your hardd work and keep going on. We always need to hear
your point of view and your prespective for better future of cuda.

k.glimps · May 6, 2024, 9:32am

Now, I used Dask+Numba to process 1T array in 25 sec. Amazing.
Numba kernels alone can only do it in 27 sec, comes second.
some other gpu solutions can take 4 mins still faster than any multicore cpu.

Amazing over Ryzen7-4800 + mobile GTX1660Ti(6 G) Nvidia gpu and just 16G RAM for the cpu.

Wow!!!

It is amazing to see Dask working.

So, by now: EITHER

A- the c-way

you can write you cuda c when “it is possible” i.e it is hard for big kernels and complicated jobs.
use nvc++ gpu acceleration, it is easier but not optimal needs hand-tunning.

B- the easier way

1- Use Numba kernels, it is easier than c-cuda-kernels.
2- Use Numba kernels + (Distribute&Compute) with Dask.

B2<<<<< is fast and much simpler <<<<<<

Thank you Nvidia for making the impossible possible.

k.glimps · May 6, 2024, 9:55am

By Dask I meant Dask_cuda

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

import cudf
import dask_cudf
from numba import cuda

k.glimps · May 6, 2024, 10:03pm

I am using the notebooks of gtc
EVALUATING YOUR OPTIONS FOR ACCELERATED
NUMERICAL COMPUTING IN PURE PYTHON
MATTHEW PENN | SENIOR DATA SCIENTIST

Slides: https://static.rainfocus.com/nvidia/gtcspring2022/sess/1638480642908001OycX/SessionFile/Evaluating%20Your%20Options%20for%20Accelerated%20Numerical%20Computing%20in%20Pure%20Python_1647528023707001MkTJ.pdf

k.glimps · May 7, 2024, 8:29am

I have isolated the numba part. Eleminated the cuML part for latter.
And played with his parameters.

The simple problem 2^17*2^15:

1- Numba kernels: finishes in 240ms much larger than his 22 ms but I think due to copy overheading.
2- Dask+Numba: finishes in 682 ms but again copy overhead.

then his big problem does not work over my limited 6G Ram (needs recoding)

for # 24-16 Problem Size (N_OBS * N_REF): 1.1T

1- Numba alone:

%%time
out_idx_nb_cuda, out_dist_nb_cuda = numba_cuda_solve(d_obs, d_ref)

CPU times: user 24.8 s, sys: 0 ns, total: 24.8 s
Wall time: 24.9 s

2- Numba Dask:

/home/mabd/.local/lib/python3.10/site-packages/distributed/client.py:3162: UserWarning: Sending large graph of size 128.51 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
2024-05-06 21:40:06,338 - distributed.nanny - WARNING - Restarting worker
CPU times: user 1.36 s, sys: 677 ms, total: 2.04 s
Wall time: 26.9 s

            *****===========================*****

the Numba alone worked for ( 2^22)^2 Problem Size (N_OBS * N_REF): 17.6T finished

CPU times: user 7min 56s, sys: 46.6 ms, total: 7min 56s
Wall time: 7min 56s

Amazing Numba eats an 17.6T array of float32 in less than 8min. <<<<<<<<<

Dask code is easier and seeing how it works really it maximizes the use of the RAM automatic up to the whole 6G.
While in numba you should do it by hand. But it is ok for both.

I need to know why Dask crashes to avoid it.

Thank you for your code. That is what we are looking forward from you.

k.glimps · May 8, 2024, 7:47am

Now, I played a little bit with Dask. It crashes less and I have tried Matthew’s code for different configs;

(change the ^ to double star)
1-
N_OBS, N_REF = 2^24, 2^16
N_OBS_VAL, N_REF_VAL = 500, 200
print(“Problem Size (N_OBS * N_REF): {:.2f}T”.format(N_OBS * N_REF * 1e-12))
it gives
Problem Size (N_OBS * N_REF): 1.10T
2-
N_OBS, N_REF = 2^25, 2^16
N_OBS_VAL, N_REF_VAL = 500, 200
print(“Problem Size (N_OBS * N_REF): {:.2f}T”.format(N_OBS * N_REF * 1e-12))
it gives
Problem Size (N_OBS * N_REF): 2.20T
3-
N_OBS, N_REF = 2^26, 2^16
N_OBS_VAL, N_REF_VAL = 500, 200
print(“Problem Size (N_OBS * N_REF): {:.2f}T”.format(N_OBS * N_REF * 1e-12))
it gives
Problem Size (N_OBS * N_REF): 4.40T

i am using %% time

the results are
1-
/home/mabd/.local/lib/python3.10/site-packages/distributed/client.py:3162: UserWarning: Sending large graph of size 128.51 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
CPU times: user 1.48 s, sys: 552 ms, total: 2.03 s
Wall time: 21 s

2-
/home/mabd/.local/lib/python3.10/site-packages/distributed/client.py:3162: UserWarning: Sending large graph of size 256.51 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
CPU times: user 2.43 s, sys: 1.2 s, total: 3.64 s
Wall time: 42.8 s

3-
/home/mabd/.local/lib/python3.10/site-packages/distributed/client.py:3162: UserWarning: Sending large graph of size 512.51 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
CPU times: user 6.23 s, sys: 2.69 s, total: 8.93 s
Wall time: 1min 28s

Now by simple linear extrapolation , i know why the 10^27 is not working, it makes big chunk 1G and already the RAM is almost full.

Any how the 2^27 should be around 3 mins. Matthew results are something 15 sec.
But, that is not the point to beat Matthew numbers rather to show that they work fine even
for a laptop’s gpu.
REMARKABLE this results are much faster than any multi-cpu.
in Matthew case for a 64core/128Thread multicore, he got 2hrs 3min 31s.
Obviouvsly, performance of 3min is faster than 2hrs 3min 31s= 123min 31 sec.

1660Ti-6G is 40X faster than AMD 7742 CPU with 64 cores (128 threads), and 512GB of system memory

Not to mention poor guys who are still doing single thread programming or using python alone!!!

Well, it remains to see how to import/combine with sage/cython ecosystem.

k.glimps · June 2, 2024, 11:52am

There is a hidden point. It is Tera Arrays of 32bits-float-point.
But it is still remarkable to see a cheap 1660Ti beats a 64core/128threads. 40X.
My cpu is 8time smaller, i.e my gpu outperforms my 4800cpu by a factor of 320x.
Wow. The question is how far one can go with 32FP.?
To Be Seen.