Your example will still have the total number of threads be a multiple of 32. Your block dimension will be [32 x 1 x 1], and if your grid is any integer value (which it has to be), then your total threads will be (INT * 32) total threads, which is still a multiple of 32.

I think what you’re really getting at is handling bounds checking, and processing data that’s not a direct multiple of the kernel launch parameters. How much this affects your performance depends on the data and your algorithm.

Some potential performance hits that immediatly come to mind are warp divergence (from data-dependant branching), wasted memory (if you employ padding to handle boundary conditions), and just generally extra instructions for bounds checking (I’m sure there’s others)

Sometimes it makes sense to use padding as you alluded to in your post, but sometimes (say, for large matrices) this will take too much memory. You can use shared memory and just load subsets (i.e. a tile that’s a multiple of 32) of the data into your shared memory, and thereby handle your bounds checking (when you get to the last tile with less than 32 valid elements, just fill in those with some identity values, i.e. 0 if you’re summing or 1 if multiplying). Or you can just check that the thread maps to a valid index, and only perform your operation if it does. You can get warp divergence from this, but as long as the conditional operations aren’t really expensive it shouldn’t be too large of a problem.

Hopefully you really were asking about boundary conditions…