Say I’ve got a 16x16 matrix of floats, and a kernel that does stuff to it, and that I’ve allocated one 16x16 block for the kernel. I want to use tiling along with shared memory, to minimize global memory access.
Let’s assume I want to use 8x8 tiles for this, along with a shared float  array, where I’d load each element contained within a tile. Then the next tile, and so on. That would mean a total of four tiles inside my block.
Question: Doesn’t that mean I’ll end up using 8x8 = 64 threads, instead of 16x16 = 256 threads?