There’s no efficiency difference between using a 1D vs 2D set of threads, it’s mostly a convenience. So what remains is the question about firing off 400 threads in one block, or firing off 32*13=416 threads, then using a test to do your work in just the first 400.
The answer is it really doesn’t matter MUCH, if you run them you’ll get the same result and there won’t be any noticeable runtime difference.
In the first case, the kernel scheduler will run blocks in warps of 32 threads at a time, and for the last warp, half the threads will be disabled/skipped.
In the second case, the exact same thing happens except your own explicit test in the code is what discards the last unneeded 16 threads.
But the first method is still preferred for three reasons. First, your code is shorter and cleaner.
The second reason is for potential later flexibility and efficiency. If later hardware allows finer grained warps of 8 threads for example… the manual method #2 will waste kernel scheduling by creating and firing off useless multiple extra warps which are immediately killed by your manual thread ID test.
The final, and most important reason to use method #1: the explicit manual test method is harder to maintain and understand because you have one thread size and one “effective” size. You have to keep them in sync with each other, and use your brain to think about their relationship. It’s unneeded complexity.
It’s actually a good question, you have to think a little about how the kernel threads are scheduled.
So the preferred method is #1, let the CUDA drivers and hardware decide the scheduling, don’t try to manually tweak it. It won’t really make much difference in practice, but that’s all the more reason not to manually muck with the threads.