I’ve read a lot about coalescing (supercomputing slides, forum here…) but I am still struggled with some questions.
I have a kernel which reads in 3 uchars (YUV420) and writes 3 numbers (RGB-values) to a PBO (uint).
If i play with the blockdimension i get different results. None of them seem to get coalesced reads from global memory.
But if i take 1616 all my writes are coalesed, and i get a 10% speedup of kernel execution time.
I’ve tried 8x8, 1212, … what is so special about the 16x16 so all my writes become coalesced?
If you need more info, just let me know. Kernel is only few lines so I can paste it once i have access to it again.