Consider a scenario in which an algorithm can be parallelized really well with a lot of threads and some of this threads share data. In order to make this efficient data should be fetched only once from global memory and then put into shared memory. This works well when the number of threads which share data is constant as explained in the matrix multiplication example.
However what if the number of threads which share data vary and the size of the data they share probably also varies. Since the block size is the same for each block in a kernel, what is the way to implement such an algorithm efficiently?