I’ve read a lot about coalescing (supercomputing slides, forum here…) but I am still struggled with some questions.
I have a kernel which reads in 3 uchars (YUV420) and writes 3 numbers (RGB-values) to a PBO (uint).
If i play with the blockdimension i get different results. None of them seem to get coalesced reads from global memory.
But if i take 1616 all my writes are coalesed, and i get a 10% speedup of kernel execution time.
I’ve tried 8x8, 12
12, … what is so special about the 16x16 so all my writes become coalesced?
If you need more info, just let me know. Kernel is only few lines so I can paste it once i have access to it again.


If you are accessing 2D tiles in gmem, same dimensions as the threadblock, then you need multiples of 16 threads in the x-dimension for coalescing. Think of the threadblock as a 2D array in row-major order, then resolve linear thread IDs from that. A half-warp is 16 consecutive threads (as indexed by these “linear” IDs), and coalescing requirements must be satisfied by half-warps.

Also, make sure you do the reads in a coalesced way. There’s been another thread on coalescing char3 reads. You can always use staging through smem as demonstrated by the float3 example in the Supercomputing 07 slides.


Everywhere I read that the dimension of a thread block should be at minimum 8x8, and it’s advised to have 128-196 threads/block.
Does this have any influence on the performance described in first post, and what is the rationale behind that minimum threads/block. I always find the described advice but I can’t find any rationale behind it.

Kind regards (and big thanks for the posts to all my questions ever so far :D)

Having a multiple of 64 threads allows the scheduler to overcome register-bank conflicts. Having 192 threads allows the scheduler to overcome register read-after-dependencies. Section of the CUDA 1.1 programming guide. However I think you can implement efficient kernels without having to have 192 threads. It is best to have threads in multiples of atleast 32 though.

The dimensions of athreadblock matter only if they affect the global memory addressing pattern by the threadblock. For example, if your 8x8 threadblock reads an 8x8 2D tile, you won’t coalesce and performance will drop. At the same time, 16x4 threadblock/tile will coalesce.

Memory access latency hiding is improved when you have many threads per multiprocessor. A multiprocessor supports up to 8 threadblocks at once, so if you’re threadblocks are really small, that’ll limit latency hiding. As a rule of thumb, I suggest striving for at least 50% occupancy for memory-bound codes. The improvement from 67% to 100% occupancy is not huge, so I’m usually satisfied if I can reach 67%. The difference between 33% and 67% is significant.


Thanks so far,

For now i’ve concentrated on trying to get the writes coalesced so far, the reads I will look in later because i think it could be a bit more difficult

 int l = (ix >> 1);

  int k = (iy >> 1);

  int m = (k * width/2 + l);

  unsigned char y   = m_videoY[width * iy + ix];	

  unsigned char u   = m_videoU[m];

  unsigned char v   = m_videoV[m];

So Uchars + in one array every thread reads a different value, where in other array one value is shared between 4 threads. Anyways before I go look into that, I would want to get a decent understanding of the coalescing.

With visual profiler i know that using the 16*16 as blockdimension all my writes become coalesced.

dst[width * iy + ix] = make_color(r,g,bl,0);

I write into 1D and make_color is

typedef unsigned int TColor;

__device__ TColor make_color(float r, float g, float b, float a)


    return ((int)(a) << 24)|((int)(b) << 16)|((int)(g) << 8)|((int)(r));


Why is blockdimension 8*x not givign coalesced writes?

Because not all 8*x are multiples of 16.