Warp scheduler and dimensionality

Twz · January 19, 2015, 8:04pm

Hi,

I’ve some doubts regarding how to understand a multidimensional block of threads. For instance let’s say we have a block of 16x16 threads working on 2D data (an image for instance), how should I understand it ? Does the scheduler will create 16 wraps of 16 threads, leading to a half-occupancy or will it creates 8 wraps of 32 threads, but then will the “coalescent” memory read constraint will be respected between thread 15 and 16 in the wrap - or the coalescent constraint doesn’t matter anymore if we respect a 128 bytes read per 16 threads ? Or does all of this should be interpreted differently regarding the compute capability of the card. I know that for older card the granularity was for a half-wrap of 16 threads. That’s kind of confusing…

Thank you in advance.

Robert_Crovella · January 19, 2015, 8:53pm

Warps are created out of groups of 32 threads.

Threads in a warp are grouped in x first, then y, then z.

warp0:

X/Y
0/0
1/0
2/0
3/0
4/0
5/0
6/0
7/0
8/0
9/0
10/0
11/0
12/0
13/0
14/0
15/0
0/1
1/1
2/1
3/1
4/1
5/1
6/1
7/1
8/1
9/1
10/1
11/1
12/1
13/1
14/1
15/1

warp 1:

0/2
1/2
…

referring to the programming guide:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy[/url]

threads with ID 0-31 compose the first warp, 32-63 compose the second warp,etc. Note the definition of thread ID I am using here is the one given in the doc link above. Not any other.

Twz · January 19, 2015, 9:54pm

Thank you for the clear explanation. There may be multiple 128 bytes requests per wrap depending of the data type. That’s also what confused me. So I understand that it does not matter if thread 0/1 in warp 0 accesses a different memory line in my image as soon as I respect an aligned 128 bytes (or multiple of 128 bytes) transfer along successive threads.

For instance in wrap 0, if thread 0/0 to 15/0 load 128 bytes (or n request of 128 bytes) and if thread 0/1 to 15/1 load 128 bytes (or n times 128 bytes) at a different memory location (second row of my image here), I should get the best performance. Am I right ?

Robert_Crovella · January 19, 2015, 11:06pm

Yes. As long as each 128 byte segment is fully utilized, the load efficiency is optimal.

Twz · January 19, 2015, 11:29pm

Great thank you. I’m glad I asked. But what happened in this case. So let’s say again that I use 32-bit data to represent a RGBX image and I use a thread block 16x16. Here, an half-wrap will load only 64 bytes of continuous data (one pixel per thread). In this case this is not an efficient memory load. There will be two memory loads and each time the cache line will be half empty, right?. I assume that in this case a thread block dimension of 8x32 would be more efficient.

Robert_Crovella · January 20, 2015, 12:21am

That’s correct. If you overlay a 16x16 threadblock on 32bit data, then each warp load will require (probably) 2 or more segments from memory. For this type of locality, the official global load efficiency should only be 50%, but for GPUs with caches, the caches are likely to mitigate some of the impact. To work around this, you can adjust the threadblock size to 32 threads in X and 8 threads in Y, or you can, if your algorithm permits, load a 64-bit quantity per thread, e.g. a uint2 vector load, and unpack the quantities and process 2 pixels per thread. And there are probably many other options as well.

Twz · January 20, 2015, 3:33pm

Great! Thanks again. That is really interesting. I didn’t know about the caches and how they can influence the performance even if the 128 bytes segment is not fully used. I will read more on that if I can find some documents explaining this. Have a nice day.

Topic		Replies	Views
Warp formation of small multidimensional blocks CUDA Programming and Performance	1	1885	June 23, 2010
Thread to warp assignement How block's threads get mapped to warps? CUDA Programming and Performance	4	7896	January 28, 2008
Thread Block Shape Versus Performance Choosing proper Thread Block Shape CUDA Programming and Performance	6	6966	May 23, 2007
Grouping of threads into warps CUDA Programming and Performance	1	3331	February 25, 2009
Multi-dimentional blocks and warps Attempts to achieve perfect coalescence. CUDA Programming and Performance	5	3272	August 1, 2009
Half WRAP -- NEWBIE help CUDA Programming and Performance	7	5636	November 4, 2008
Block Size.. CUDA Programming and Performance	2	1780	July 11, 2008
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8473	March 28, 2008
Warp layout in a 2D thread block? CUDA Programming and Performance	6	8523	July 21, 2011
question about warp, block and threads CUDA Programming and Performance	4	2002	February 3, 2009

Warp scheduler and dimensionality

Related topics