I am developing a ray-caster for algebraic surfaces in Cuda, where each thread corresponds to one ray. In short every thread has to calculate the coefficients of it’s univariate polynomial along a ray, where later numerical root finding is used to locate the intersection point(s). I can calculate the coefficients in several different ways, what they have in common is that for each thread, they all use the exact same number of texture fetches and arithmetic operations.

However, it seems that it is significantly faster to let each thread calculate it’s coefficients, then store these to global memory. Then issue a second pass for the root finding, where each thread starts by loading it’s coefficient from global memory.

Since the coefficient calculation is similar for every thread, I would believe splitting it into two passes would make it slower, not faster. Anyone who would like to share some insight on this?