Quick question about kernel launch overhead and algorithm design...

Hello,

I have some mesh code that I want to parallelize like in this paper : http://www.comp.nus.edu.sg/~tants/gdel3d_files/AshwinNanjappaThesis.pdf

Basically, we have a triangular mesh and a set of points. We find a point in every triangle and insert it which fractures it and creates more triangles. We repeat this until there are no more points to be inserted.

Now, I’m pretty sure that this cannot be done in one single CUDA kernel and if it was, it’d require a lot of locking mechanisms which is needless complexity.

My main idea is to instead break this problem up into small kernels where the host uses cudaDeviceSynchronize() to ensure that each kernel is finished before another one starts.

My biggest concern is the overhead of launching all these kernels.

Assuming that each kernel does relatively complex or simple things, should I be concerned about kernel launches? Or should I try to cram everything into as few kernels as possible? I mean, I will try to maximize each one but there’s a point in time where I NEED a kernel to finish completely before I can begin processing data on it.

If you were wondering what I’m really trying to do is, I’m going to take that algorithm in the paper I posted and instead of perturbing points into general position (so that no one point is on the edge of a triangle), I want the code to be able to hand such instances where a point is on an edge.

If you mean you are avoiding launching more thread blocks (for kernel) than there is ‘capacity’ on the GPU device, then you should not worry about that at all.

kernel launch overhead seems to be minimal anyway, what is more important in the correct execution of memory loads/writes and the design of algorithm.

I recently tested if it was better to launch exactly as many threads(in blocks) as the GPU had capacity(with each thread block staying active until all the work was done).

It ends up it is far better to over-subscribe. This is is the link with my test, and pay attention to nuffa’s response:

[url]64-bit scan/reduction for min element (double). Two example implementations, but unexpected timing - CUDA Programming and Performance - NVIDIA Developer Forums

Maybe you meant something else, but in general kernel launch overhead should not a major concern on recent GPUs. Or at least that has been my experience.

Oh thank God.

It’s just, it simplifies my algorithm design by orders of magnitude.

I was also imagining that it wouldn’t be as costly as creating and launching a thread on the CPU because I mean, what else does a GPU do besides just sit there and wait? I could be wrong. I’m also 3 margaritas in so if I said something wrong, it’s the smooth taste of agave talking XD