It works just as well for classical BLAS-type matrix manipulations. My design approach is slightly different. I consider one output element per thread as an initial baseline, which usually makes for the simplest code.
Depending on what performance analysis indicates, one can then extend this to “a few” output elements produced per thread, in particular with the goal of optimizing memory transactions, as discussed by Robert Crovella in the SO answer.
Combining multiple pixels for processing is often attractive when retrieving individual pixels results in narrow loads. Using a wide load that retrieves multiple pixels in one access is beneficial in that context, at minimum by reducing dynamic instruction count.