CUDA & tiling clarification

Say I’ve got a 16x16 matrix of floats, and a kernel that does stuff to it, and that I’ve allocated one 16x16 block for the kernel. I want to use tiling along with shared memory, to minimize global memory access.

Let’s assume I want to use 8x8 tiles for this, along with a shared float [8][8] array, where I’d load each element contained within a tile. Then the next tile, and so on. That would mean a total of four tiles inside my block.

Question: Doesn’t that mean I’ll end up using 8x8 = 64 threads, instead of 16x16 = 256 threads?

If you have 4 tiles accessible to a single block at the same time, then a shared float [8][8] array is not going to be sufficient.

However, if you’re only dealing with one 8x8 tile at a time, then I’d say yes, typically you would have an 8x8 float array, and an 8x8 thread array. Such a thread array could still process 4 tiles, reusing the 8x8 thread array after each tile processing is finished.

But this is a computer program, right? There are usually many ways to accomplish alsmost anything.