Warp scheduler and dimensionality

Hi,

I’ve some doubts regarding how to understand a multidimensional block of threads. For instance let’s say we have a block of 16x16 threads working on 2D data (an image for instance), how should I understand it ? Does the scheduler will create 16 wraps of 16 threads, leading to a half-occupancy or will it creates 8 wraps of 32 threads, but then will the “coalescent” memory read constraint will be respected between thread 15 and 16 in the wrap - or the coalescent constraint doesn’t matter anymore if we respect a 128 bytes read per 16 threads ? Or does all of this should be interpreted differently regarding the compute capability of the card. I know that for older card the granularity was for a half-wrap of 16 threads. That’s kind of confusing…

Warps are created out of groups of 32 threads.

Threads in a warp are grouped in x first, then y, then z.

warp0:

X/Y
0/0
1/0
2/0
3/0
4/0
5/0
6/0
7/0
8/0
9/0
10/0
11/0
12/0
13/0
14/0
15/0
0/1
1/1
2/1
3/1
4/1
5/1
6/1
7/1
8/1
9/1
10/1
11/1
12/1
13/1
14/1
15/1

warp 1:

0/2
1/2

referring to the programming guide:

threads with ID 0-31 compose the first warp, 32-63 compose the second warp,etc. Note the definition of thread ID I am using here is the one given in the doc link above. Not any other.

Thank you for the clear explanation. There may be multiple 128 bytes requests per wrap depending of the data type. That’s also what confused me. So I understand that it does not matter if thread 0/1 in warp 0 accesses a different memory line in my image as soon as I respect an aligned 128 bytes (or multiple of 128 bytes) transfer along successive threads.

For instance in wrap 0, if thread 0/0 to 15/0 load 128 bytes (or n request of 128 bytes) and if thread 0/1 to 15/1 load 128 bytes (or n times 128 bytes) at a different memory location (second row of my image here), I should get the best performance. Am I right ?

Yes. As long as each 128 byte segment is fully utilized, the load efficiency is optimal.

Great thank you. I’m glad I asked. But what happened in this case. So let’s say again that I use 32-bit data to represent a RGBX image and I use a thread block 16x16. Here, an half-wrap will load only 64 bytes of continuous data (one pixel per thread). In this case this is not an efficient memory load. There will be two memory loads and each time the cache line will be half empty, right?. I assume that in this case a thread block dimension of 8x32 would be more efficient.