Hello,

I have some mesh code that I want to parallelize like in this paper : http://www.comp.nus.edu.sg/~tants/gdel3d_files/AshwinNanjappaThesis.pdf

Basically, we have a triangular mesh and a set of points. We find a point in every triangle and insert it which fractures it and creates more triangles. We repeat this until there are no more points to be inserted.

Now, I’m pretty sure that this cannot be done in one single CUDA kernel and if it was, it’d require a lot of locking mechanisms which is needless complexity.

My main idea is to instead break this problem up into small kernels where the host uses cudaDeviceSynchronize() to ensure that each kernel is finished before another one starts.

My biggest concern is the overhead of launching all these kernels.

Assuming that each kernel does relatively complex or simple things, should I be concerned about kernel launches? Or should I try to cram everything into as few kernels as possible? I mean, I will try to maximize each one but there’s a point in time where I NEED a kernel to finish completely before I can begin processing data on it.

If you were wondering what I’m really trying to do is, I’m going to take that algorithm in the paper I posted and instead of perturbing points into general position (so that no one point is on the edge of a triangle), I want the code to be able to hand such instances where a point is on an edge.