Feedback on a paper I wrote about 3d triangulations!

Hey guys,

This forum has honestly helped me a lot in the past so I’m hoping I could maybe get some feedback on this quick and dirty paper I wrote about meshing.

Let me know if anything doesn’t make any sense or needs more clarification.

https://github.com/LeonineKing1199/Regulus-1.8/blob/master/regulus-paper.pdf

Thanks for the link and the source code.
All to often authors of GPGPU papers only speak about the ‘high-level’ details and gloss over implementation specifics.

What about benchmark comparisons of your GPU implementation vs a decent CPU implementation? You could have a comparison chart along the lines of this older project of mine(as an example);

https://github.com/OlegKonings/CUDA_brute_triangle

Keep us updated.

So I got some feedback from another forum about my point redistribution scheme.

As of now it takes a solid 3 kernels to make it happen. 2 of those are literally just getting the write indices.

I know that there’s a range of potential valid values to write to. What if I just created a random number and prayed that there were no collisions?

“What if I just created a random number and prayed that there were no collisions?”

i am told that hope is not a strategy
there is some truth in that; so, i suppose the best is to have hope and a strategy

i understand why it is dragged/ thrown into the conversation, but i do not see why the number of kernels is overly important
number of kernels is a (poor) proxy for redundancy; redundancy is of far greater importance
if you have implemented your desired functionality, and it is ‘minimal’/ ‘economical’, then the number of kernels is less of an issue
there are many ways to skin a cat; the question is whether you have used the cheapest one
unfortunately, the profiler can not profile redundancy for you, although its measures help in this regard
you need to study the code, the amount of instructions to implement and achieve a functional output, and contemplate whether it can be made cheaper - not only by considering hardware-alignment, but also (rather) algorithm design
this may imply a greater focus on the algorithm itself, rather than its (hardware-oriented) implementation, and may imply exploiting features/ characteristics within the data, and/ or pertaining to the problem more

This is outside my area of expertise, but wouldn’t random selection in this context run into the birthday paradox, making collisions a lot more likely than one might naively assume? Algorithms that give rise to collision (e.g. ordinary hashes) do need a strategy on how to handle those when they occur.

Interesting. Redundancy would be the biggest issue.

One thing is, when I fracture, I write everything back to the mesh buffer. This is cool. But then when I redistribute, I have to load in those tetrahedra again.

I wish I could redistribute from the fracture kernel. But this is problematic. It’s not like I can’t get the write indices before the fracture kernel for the redistribution routines but I’m not sure how to do the threading model.

We have one thread per point to insert but then there could be any number of points residing in a fractured tetrahedra. Dynamic parallelism would be my go-to but the last time I used it, I got slapped in the face pretty hard by it (i.e. it didn’t work well).

Personally I find that some out-of-the-box thinking coupled with experiments often leads to superior solutions compared to what is described in literature as tried-and-true methods. A lot of “conventional wisdom” in computer science and engineering tends to be overly generalized or suboptimal since out of tune with changed circumstances such as the introduction of massively parallel processors that provide ridiculously cheap FLOPs. Also methods that are considered insufficiently stable for the general case may prove very useful in particular circumstances, e.g. anything involving matrix inverse or pseudo-inverse, forward recurrences.

My preferred approach is to first try to tackle a problem from an almost blank slate to avoid boxing in my mind with results from existing literature, getting a feel for the subject matter through experiments. Once I get stuck trying to tackle the issue on my own, I then consult as much of the literature as I can, including obscure or very old papers. Some of those previously discarded ideas may be applicable once again (e.g. dealing with very expensive double-precision computation) or may make sense with today’s architectures where they didn’t make much sense with yesterday’s architectures (anything involving redundant computation).

“but then there could be any number of points residing in a fractured tetrahedra”

interestingly, mathematicians thought it useful to devise and distinguish between countable infinite and uncountable infinite
i have learned that it pays to rather not label a variable or factor as pure random or pure deterministic, but as something in the middle
in many cases, random factors/ variables are deterministically random
it means you can devise regions or boundaries, and work with these; already you are no longer facing a purely random variable

“Dynamic parallelism would be my go-to”

if you are thinking of using dp (solely) for purposes of scaling - thread blocks can scale too, perhaps less graciously, with loops and conditionals
a thread block can handle both more and less ‘points’/ threads than the number of resident thread block threads, through said methods

Hmm…

That’s some good advice, njuffa and jimmy. I’ll have to think about this. Man, this is tough though!

Meshing is hard T_T

Also jimmy, what do you mean by thread blocks? Because I was thinking that I could use one thread block per insertion point, have the first thread of each block actually make the fractures (writing the tetrahedron to the stack) and then from there the rest of the block could work on distributing the points. This doesn’t feel like it exploits a lot of parallelism though. Maybe I could use DP to extend the number of blocks?

“Also jimmy, what do you mean by thread blocks?”

i am working from what you have posted thus far - the little i know about your project/ problem and its dynamics
the point was that, if you use your thread block(s) more dynamically, you may find that you can substitute dp to some extent, and actually do not have to turn to dp to “extend the number of blocks”
among other reasons, one may wish to turn to dp to gear up/ down the number of threads executing functionality; however, dp is not the only way this can be done, i would think

“Because I was thinking that I could use one thread block per insertion point, have the first thread of each block actually make the fractures (writing the tetrahedron to the stack) and then from there the rest of the block could work on distributing the points. This doesn’t feel like it exploits a lot of parallelism though.”

a very ‘fat’ paragraph with a lot to unpack, perhaps
the cost of fracturing, relative to the cost of distributing would perhaps determine whether the current arrangement is acceptable or too costly
if it is rather quick to fracture, compared to distribution, it would not waste/ cost too much if a single thread alone fractures
(the warps of) a device briefly emulating and acting as an intel phi, with multiple single-threaded warps executing, may be both optimal and sub-optimal - one could argue that the warps are poorly utilized; on the other hand, that it may still be better than utilizing the host for the particular task (a 15 sm device that runs only 1 thread per warp still yields 15 x 64 theoretical threads as ‘cores’)…

the other alternative is a thread block - and thread blocks - fracturing multiple cases in unison, with the same, or other, thread blocks then distributing

you could also gear your fracturing::distribution, depending on cost and execution time
assign (independent) fracturing/ distribution blocks depending on the time it takes to fracture/ distribute

i develop the feeling that you may have dependencies as well
any effort to circumvent or negate these should result in better parallelism too
perhaps different blocks/ kernels can independently and concurrently work in different regions, if fracturing and distribution stacks are location-dependent