Performance of many small kernels vs one big one? Which is better?

I have a general performance question. I’m working on a fluid solver. With many steps in the solver (e.g. advection), a number of variables are updated based on their previous values and the values of some common variables (i.e. the velocity field). Typically, updating each variable requires some unique texture lookups but also many lookups shared by updates to other variables.

In such a scenario, is it better to update each variable in its own kernel (and thus have many redundant texture lookups) or is it better to combine everything into one big kernel (and risk additional cache misses)?

I’m not sure if the stuff I’m dealing with has the same properties as your problem, but in my case, one long call to an “aggregate” kernel was much, much faster than successive kernel calls to a kernel that did one step each call. Generally.

Agree with Stickguy from my experience. The most time-consuming part of CUDA kernel is loading data from Global memory because the global memory is not cached. Seperate kernels might have to do redundant global memory access.

I did a small test where I performed the advection with two kernels and then with one kernel. The one kernel was about 25% faster than two kernels.

EDIT: However, after playing around with the block and grid size, the one kernel version is now twice as fast as the original, and the two kernel version is 25% faster than the faster one kernel version. Strange.

I think the biggest problem here is kernel call overhead; if you do a lot of really small kernel calls this will start to be a serious performance bottleneck.

You would have to be doing quite a few kernel calls for this to matter. Memcpy is a giant issue here, but since the data should still be allocated to the device, you should be fine. The problem arises when you start memcpy’ing back and forth between the CPU and GPU. If you are calling say 3-5 kernels compared to one, you should be able to get an advantage from the multiple kernels IFF each kernel has seperate thread/block requirements for optimization. If each kernel is roughly using the same thread/block dimensions, then there isn’t a real benefit from multiple calls (that I can see).

Another issue I noticed with the larger kernels is that they tended to run out of registers and have to use local memory for some variables. This has a pretty big negative impact on the performance.