Warp divergence branching inside loop

indigo80 · August 13, 2008, 6:52am

hi,

please correct me if Im wrong.

I have a kernel like that:

_global__ void someKernel(float* g_in, float* g_out){

if(tid<32){

 Â  Â //here first whole warp do some work, rest is waiting on barrier below

 Â  Â //each thread from warp reads some data from global to shared here

 Â  Â //(data is aligned to 32 and every thread from warp reads consecutive entry)

 Â  Â //- it should be coalesced read right?

 }

syncthreads(); //here all threads meet and continue

for(int g=0;g<64;++g){ //here we enter a big loop in each thread

Â  if(tid == 0){

 Â  Â  //here I need just one thread to do some preparations on the beginning

 Â  Â  //of every iteration. Rest is waiting on barrier below.

 Â  }

 Â  

 Â  syncthreads(); //all-thread meeting here - lets continue 

 Â  

 Â  //here all threads(warps) do some work in every 'for' iteration - no branching

 Â  //here so all warps are executing in parallel - right?

 Â  //but Im worried about that big loop and solo execution of thread 0 

 Â  //at the beginning of every iteration :/ - does it serialize execution?

Â  //DO WORK

Â  syncthreads(); //do i need this barrier here?

} //end of big loop

//we're out loop

 //every thread writes some data from shared back to global 

 //and kernel finishes

}

Do i get kernel flow right? - will it work like i wrote in comments above?

thank you!

edit: I mean just the kernel structure - serialization of execution, parallelism. Im not sure if i get it right.

MisterAnderson42 · August 13, 2008, 1:29pm

Yep, you’ve got everything correct.

The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.

Here is another way to think of __syncthreads(); It is a barrier to use whenever a thread touches shared memory that will be later read by a different thread. In loops, you need to have __syncthreads() both before and after the memory is touched so that one thread doesn’t get too far ahead and write a value that other threads are still reading, so your __synthreads() at the end of the loop is needed. I usually write it like this so the memory that the syncthreads() is protecting is made more obvious than stuffing a synchtreads at the end of the loop:

__synthreads()

if (threadIdx meats condition)

     update shared memory

__synthreads()

indigo80 · August 14, 2008, 7:41am

oh ok then - thx!

I’ve modified my kernel a little - found that in at-the-beginning-of-the-loop initialization phase I can divide my operations into two parallel parts. Please take a look once more, then if its ok, I ll be assured I finally understand this right.

Please take a look:

__global__ void kernel(...){

if(tid < 32){

 //some work done in parallel by threads from first warp - rest is waiting.

 //on first barrier inside loop

}

// no syncthreads() here! I moved it down into loop to avoid it at the

// end of loop (according to what you wrote in last post) - hope its ok, or do I need 

// it here also?

for(int i = 0; i<64; ++i) { //big loop begins

 syncthreads();

  if(tid == 0){

    //some work done by thread 0 (warp 0)

  }

  if(tid == 1) //OR SHOULD IT BE tid==32 in order to paralellize this initialization??

    //some work done by thread 1 (warp 0)  //thread 32 (warp 1)

  }

  syncthreads();

 //some work in parallel - all threads/warps

  //no syncthreads() here! - instead at the beginning of the loop

}

syncthreads(); //I dont know if i need it here, since I have one at loop begin

//all threads write to global

} //kernel finish

Is this code structure optimal for this algorithm kind ? :

Initialize at kernel start - parallelizable for 32 threads

//LOOP START

Initialize at loop start - parallelizable for 2 threads
Parallel work

//LOOP END

Kernel “de-briefing” - parallelizable for all threads.

And two quick ones:

First:

What is the right way to pass array (in shared memory) to device funtion:

__device__ void devFun(float2 * arr){

  //...

}

__global__ void kernel(....){

  __shared__ float2 s_array[76];

  //...

  devFun(&s_array);

}

devFun performs some work on passed array. But compiler protest when I try to do above.

Second:

I have float array of 3136 in shared memory - s_out. I need to write it back to global memory - g_out. I have 448 threads in block (14 warps).

So every thread has to write 7 values into global (448*7=3136).

3136 divides by 16, by 32, by 7, by 14, by 448 - so I can do the coalesced write, right?

Does this performs the trick? :

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

g_out[addr] = s_out[addr];  //0 ... 447

g_out[addr+448] = s_out[addr+448];  //448 ...895

g_out[addr+896] = s_out[addr+896];  //896...1343

g_out[addr+1344] = s_out[addr+1344];  //1344...1791

g_out[addr+1792] = s_out[addr+1792];  //1792...2239

g_out[addr+2240] = s_out[addr+2240];  //2240...2687

g_out[addr+2688] = s_out[addr+2688];  //2688...3135

or the same in loop:

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

  g_out[addr+offset] = s_out[addr+offset];

}

Will above 7 writes be coalesced? - this one is really hard for me External Media

Maybe Im asking for too much, but if you could provide answer to above questions it will be of great help to me - I need someone to assure me that i got everything right, or prove im wrong. So thank you very very much!

VrahoK · August 14, 2008, 8:03am

you should use

devFun(s_array);

or

devFun(&s_array[0]);

in your case you are passing the address of the pointer, not the pointer itself

Vrah

indigo80 · August 14, 2008, 8:15am

what a dumb mistake External Media thanks!

so two things left - code structure and coalescing , anyone? B)

VrahoK · August 14, 2008, 8:33am

Though I’m not 100% sure, I think this should work if you take the threadIdx.x as starting offset and your g_out is aligned. Then the writes in each warp are following each other so they should be coalesced.

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

 Â g_out[addr+offset] = s_out[addr+offset];

}

Edit:

You can also write your code more flexible to the block and grid dimensions:

//at the end of kernel

int addr = blockIdx.x * 7 * blockDim.x; //block starting address

for(unsigned int offset = threadIdx.x; offset < 7 * blockDim.x; offset += blockDim.x){

  g_out[addr+offset] = s_out[addr+offset];

}

Only thing I’m not sure about is if the starting address meets the coalescing requirements. (blockIdx.x * 7 * 14 * 32)

indigo80 · August 14, 2008, 10:07am

there is an error in my code, it of course should be:

//at the end of kernel

int addr = blockIdx.x*3136+tid; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

 g_out[addr+offset] = s_out[addr+offset];

}

Ive added tid to addr.

so its the same to what you, thanks for corretion, wrote:

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

 g_out[addr+offset] = s_out[addr+offset];

}

and final question remain unanswered - is this coalesced?

you state that writing to global like above should be ok, and only thing uncertain is block starting address (bid71432 = bid3136) - so… when starting address is ok? :huh:

According to Programming Guide:

so … type is float so 16sizeof(float) is 64. Starting address for each block is divisible by 64 : bid3136/64 = bid * 49. Starting offset for all 7 writes is also divisible by 64:

448/64 = 7

896/64 = 14

1344/64 = 21

1792/64 = 28

2240/64 = 35

2688/64 = 42

so long everything is divisible by 64, but…

ok lets than take first write (offset 0) in block 1

block starting address is then 1*3136 = 3136

lets take second warp in this block, its first halfwarp - the first thread of this halfwarp is thread 32 → so the halfwarp starting address is 3136+32=3168

so: HalfWarpBaseAddress-BaseAddress = 3168-3136=(of course) 32.

32/16*sizeof(float)=32/64=0.5 - so it isnt divisible… but sticking to my computations coalesced access never occurs so i think i dont get it :(

indigo80 · August 15, 2008, 10:37pm

anyone? :(

ColinS · August 18, 2008, 10:50pm

I think it will be coalesced, since you’re writing float values, which is 4 bytes. (3136 + 4 16halfWarpNum) is always going to be divisible by 64, so you’re all set :)

MisterAnderson42 · August 19, 2008, 12:19pm

You can always run your code with the visual profiler and actually measure if it is coalesced or not.

indigo80 · August 20, 2008, 4:43pm

man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before :">

checked and yes, things are coalesced!

tmurray · August 20, 2008, 4:56pm

Keep in mind that the profiler will claim things are always coalesced on GT200, even when they’re clearly not…

Topic		Replies	Views
There must be a reason, but I can not figure out:( CUDA Programming and Performance	8	8090	January 3, 2008
coalescing problem CUDA Programming and Performance	4	1064	August 8, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2772	December 8, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1991	October 8, 2020
global to shared mem loads and sync CUDA Programming and Performance	26	11441	February 21, 2008
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6677	June 2, 2012
Problem with correct branching within a warp CUDA Programming and Performance	23	15641	May 28, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2158	February 27, 2011

Warp divergence branching inside loop

Related topics