Warp scheduler and dimensionality

Hi,

I’ve some doubts regarding how to understand a multidimensional block of threads. For instance let’s say we have a block of 16x16 threads working on 2D data (an image for instance), how should I understand it ? Does the scheduler will create 16 wraps of 16 threads, leading to a half-occupancy or will it creates 8 wraps of 32 threads, but then will the “coalescent” memory read constraint will be respected between thread 15 and 16 in the wrap - or the coalescent constraint doesn’t matter anymore if we respect a 128 bytes read per 16 threads ? Or does all of this should be interpreted differently regarding the compute capability of the card. I know that for older card the granularity was for a half-wrap of 16 threads. That’s kind of confusing…

Thank you in advance.

Warps are created out of groups of 32 threads.

Threads in a warp are grouped in x first, then y, then z.

warp0:

X/Y
0/0
1/0
2/0
3/0
4/0
5/0
6/0
7/0
8/0
9/0
10/0
11/0
12/0
13/0
14/0
15/0
0/1
1/1
2/1
3/1
4/1
5/1
6/1
7/1
8/1
9/1
10/1
11/1
12/1
13/1
14/1
15/1

warp 1:

0/2
1/2

referring to the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

threads with ID 0-31 compose the first warp, 32-63 compose the second warp,etc. Note the definition of thread ID I am using here is the one given in the doc link above. Not any other.

Thank you for the clear explanation. There may be multiple 128 bytes requests per wrap depending of the data type. That’s also what confused me. So I understand that it does not matter if thread 0/1 in warp 0 accesses a different memory line in my image as soon as I respect an aligned 128 bytes (or multiple of 128 bytes) transfer along successive threads.

For instance in wrap 0, if thread 0/0 to 15/0 load 128 bytes (or n request of 128 bytes) and if thread 0/1 to 15/1 load 128 bytes (or n times 128 bytes) at a different memory location (second row of my image here), I should get the best performance. Am I right ?

Yes. As long as each 128 byte segment is fully utilized, the load efficiency is optimal.

Great thank you. I’m glad I asked. But what happened in this case. So let’s say again that I use 32-bit data to represent a RGBX image and I use a thread block 16x16. Here, an half-wrap will load only 64 bytes of continuous data (one pixel per thread). In this case this is not an efficient memory load. There will be two memory loads and each time the cache line will be half empty, right?. I assume that in this case a thread block dimension of 8x32 would be more efficient.

That’s correct. If you overlay a 16x16 threadblock on 32bit data, then each warp load will require (probably) 2 or more segments from memory. For this type of locality, the official global load efficiency should only be 50%, but for GPUs with caches, the caches are likely to mitigate some of the impact. To work around this, you can adjust the threadblock size to 32 threads in X and 8 threads in Y, or you can, if your algorithm permits, load a 64-bit quantity per thread, e.g. a uint2 vector load, and unpack the quantities and process 2 pixels per thread. And there are probably many other options as well.

Great! Thanks again. That is really interesting. I didn’t know about the caches and how they can influence the performance even if the 128 bytes segment is not fully used. I will read more on that if I can find some documents explaining this. Have a nice day.