One large or several small kernels

I am developing a ray-caster for algebraic surfaces in Cuda, where each thread corresponds to one ray. In short every thread has to calculate the coefficients of it’s univariate polynomial along a ray, where later numerical root finding is used to locate the intersection point(s). I can calculate the coefficients in several different ways, what they have in common is that for each thread, they all use the exact same number of texture fetches and arithmetic operations.

However, it seems that it is significantly faster to let each thread calculate it’s coefficients, then store these to global memory. Then issue a second pass for the root finding, where each thread starts by loading it’s coefficient from global memory.

Since the coefficient calculation is similar for every thread, I would believe splitting it into two passes would make it slower, not faster. Anyone who would like to share some insight on this?

Did you investigate the resource usage of the two variants? Shared mem usage, shared mem bank conficts, #registers, coalesced device mem access and #warps per block all influence the speed. As the two-pass version is faster, I suspect the single-pass version uses so many registers that the multiprocessors cannot hide mem access latency by running several blocks at once. Check this with the occupancy calculator (it has a howto inside).


Thanks. This seems to be the explanation. My first pass uses several shared blocks.