The second method is generally preferred. There are a variety of reasons for this.
- Every kernel launch has overhead (a time cost). Fewer kernel launches - less overhead.
- You generally want to do as much work per kernel launch as possible. This is related to items 1 and 3, and also so that if doing more work equates to being able to launch more threads, you have a better opportunity to saturate the machine and give the GPU the best opportunity for latency hiding.
- The idea of kernel fusion. By combining operations that are working on the same data, we may be able to reduce loads/stores to global memory, which can have a significant impact on a memory-bound code.
The only thing I can think of to commend the first method is that if your problem size is so small, that you cannot saturate the GPU with a single kernel launch. My preference there would be to seek to expose more parallelism.
In any event, it should be easy enough to cook up an example for both cases, and benchmark and draw your own conclusion. I generally advise against taking work that can be done in a single kernel and breaking it into 2 or more kernels. I’m sure exceptions, corner cases can be found.