[SOLVED] Knowledge-sharing from a thread I accidentally found in SO (latency hiding details)

If you are a CUDA senior, then you certainly know this, but as a junior I’m sorry I just found this now, and by accident:
[url]cuda - Is starting 1 thread per element always optimal for data independent problems on the GPU? - Stack Overflow

It clarifies a lot of questions I had in the beginning, and still had until today.

For image processing algorithm, we have made good experience with letting each thread process a few pixels (typically 4). So it is our default strategy for all our image processing CUDA kernels.

More information at http://on-demand.gputechconf.com/gtc/2018/presentation/s8111-high-performance-image-processing-routines-for-video-and-film-processing.pdf

It works just as well for classical BLAS-type matrix manipulations. My design approach is slightly different. I consider one output element per thread as an initial baseline, which usually makes for the simplest code.

Depending on what performance analysis indicates, one can then extend this to “a few” output elements produced per thread, in particular with the goal of optimizing memory transactions, as discussed by Robert Crovella in the SO answer.

Combining multiple pixels for processing is often attractive when retrieving individual pixels results in narrow loads. Using a wide load that retrieves multiple pixels in one access is beneficial in that context, at minimum by reducing dynamic instruction count.

Nice stuff there, Hannes.
(Off-topic: I’m planning a motorcycle tour to Mozarthaus this summer)

thanks, enjoy :-)