Hi, Since i am doing scientific computations.
Higher genus riemann surfaces = multi-solitons solns of non-linear systems of pde.
i use higher dimensions zetas with quaternions/octonions and even beyond.
nvcc++ is great and with with four and eight dimensions it works better than 2d problems. better loading and distribution better optimization.
that is really good. TOO COOL to be true
Now, i tried to offloading gpu+cpu at the same time.
it is not allowed by default. But a work around is easy you build your work
into two seperate so dynamical libraries then using c(gcc) just fork and join.
It works nice How it is possible you haven’t allowed this it gives significant gain. at least double performance over the initial gain.
Now, the technical problem, a genius guy in NVIDIA has done a nice job really nice.
your implementation of fp64 (float point) is different.
your cpu implementation is based on llvm which is bad for fp64 not just less than gcc but worse. it gives wrong number.
IF your interested i can give explicit examples (but excuse the math).
It shall be nice if you uniformize your fp64 to the gpu i.e change the default
llvm. I have tested llvm through Numba extensively beyond it crashes too much
with huge arrays and many precession problems. LLVM and NUMBA are not
good for HPC at all. They sXs bad, give wrong answers not just precession and
crashes problems.
PLEASE MAKE OUR LIFE EASY if you are really in highly scientific computations.
UNIFORMAIZE your fp64 for favor of your gpu implementation IT IS GREAT
even better than gcc THAT IS REALLY an achievement you should be proud of it and build upon it.
Ok. If you need more details i can give concrete I am working with “higher” solitons for the last 20+ years
Thank you.
Hoping hearing good news.