Warp divergence branching inside loop

hi,

please correct me if Im wrong.

I have a kernel like that:

_global__ void someKernel(float* g_in, float* g_out){

if(tid<32){

 Â  Â //here first whole warp do some work, rest is waiting on barrier below

 Â  Â //each thread from warp reads some data from global to shared here

 Â  Â //(data is aligned to 32 and every thread from warp reads consecutive entry)

 Â  Â //- it should be coalesced read right?

 }

syncthreads(); //here all threads meet and continue

for(int g=0;g<64;++g){ //here we enter a big loop in each thread

  if(tid == 0){

 Â  Â  //here I need just one thread to do some preparations on the beginning

 Â  Â  //of every iteration. Rest is waiting on barrier below.

 Â  }

 Â  

 Â  syncthreads(); //all-thread meeting here - lets continue 

 Â  

 Â  //here all threads(warps) do some work in every 'for' iteration - no branching

 Â  //here so all warps are executing in parallel - right?

 Â  //but Im worried about that big loop and solo execution of thread 0 

 Â  //at the beginning of every iteration :/ - does it serialize execution?

  //DO WORK

  syncthreads(); //do i need this barrier here?

} //end of big loop

//we're out loop

 //every thread writes some data from shared back to global 

 //and kernel finishes

}

Do i get kernel flow right? - will it work like i wrote in comments above?

thank you!

edit: I mean just the kernel structure - serialization of execution, parallelism. Im not sure if i get it right.

Yep, you’ve got everything correct.

The threads of the first warp will diverge shortly at the beginning of the loop, but then converge so everything is parallel in the guts of the loop.

Here is another way to think of __syncthreads(); It is a barrier to use whenever a thread touches shared memory that will be later read by a different thread. In loops, you need to have __syncthreads() both before and after the memory is touched so that one thread doesn’t get too far ahead and write a value that other threads are still reading, so your __synthreads() at the end of the loop is needed. I usually write it like this so the memory that the syncthreads() is protecting is made more obvious than stuffing a synchtreads at the end of the loop:

__synthreads()

if (threadIdx meats condition)

     update shared memory

__synthreads()

oh ok then - thx!

I’ve modified my kernel a little - found that in at-the-beginning-of-the-loop initialization phase I can divide my operations into two parallel parts. Please take a look once more, then if its ok, I ll be assured I finally understand this right.

Please take a look:

__global__ void kernel(...){

if(tid < 32){

 //some work done in parallel by threads from first warp - rest is waiting.

 //on first barrier inside loop

}

// no syncthreads() here! I moved it down into loop to avoid it at the

// end of loop (according to what you wrote in last post) - hope its ok, or do I need 

// it here also?

for(int i = 0; i<64; ++i) { //big loop begins

 syncthreads();

  if(tid == 0){

    //some work done by thread 0 (warp 0)

  }

  if(tid == 1) //OR SHOULD IT BE tid==32 in order to paralellize this initialization??

    //some work done by thread 1 (warp 0)  //thread 32 (warp 1)

  }

  syncthreads();

 //some work in parallel - all threads/warps

  //no syncthreads() here! - instead at the beginning of the loop

}

syncthreads(); //I dont know if i need it here, since I have one at loop begin

//all threads write to global

} //kernel finish

Is this code structure optimal for this algorithm kind ? :

  1. Initialize at kernel start - parallelizable for 32 threads

//LOOP START

  1. Initialize at loop start - parallelizable for 2 threads

  2. Parallel work

//LOOP END

  1. Kernel “de-briefing” - parallelizable for all threads.

And two quick ones:

First:

What is the right way to pass array (in shared memory) to device funtion:

__device__ void devFun(float2 * arr){

  //...

}

__global__ void kernel(....){

  __shared__ float2 s_array[76];

  //...

  devFun(&s_array);

}

devFun performs some work on passed array. But compiler protest when I try to do above.

Second:

I have float array of 3136 in shared memory - s_out. I need to write it back to global memory - g_out. I have 448 threads in block (14 warps).

So every thread has to write 7 values into global (448*7=3136).

3136 divides by 16, by 32, by 7, by 14, by 448 - so I can do the coalesced write, right?

Does this performs the trick? :

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

g_out[addr] = s_out[addr];  //0 ... 447

g_out[addr+448] = s_out[addr+448];  //448 ...895

g_out[addr+896] = s_out[addr+896];  //896...1343

g_out[addr+1344] = s_out[addr+1344];  //1344...1791

g_out[addr+1792] = s_out[addr+1792];  //1792...2239

g_out[addr+2240] = s_out[addr+2240];  //2240...2687

g_out[addr+2688] = s_out[addr+2688];  //2688...3135

or the same in loop:

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

  g_out[addr+offset] = s_out[addr+offset];

}

Will above 7 writes be coalesced? - this one is really hard for me :/

Maybe Im asking for too much, but if you could provide answer to above questions it will be of great help to me - I need someone to assure me that i got everything right, or prove im wrong. So thank you very very much!

you should use

devFun(s_array);

or

devFun(&s_array[0]);

in your case you are passing the address of the pointer, not the pointer itself

Vrah

what a dumb mistake :argh: thanks!

so two things left - code structure and coalescing , anyone? B)

Though I’m not 100% sure, I think this should work if you take the threadIdx.x as starting offset and your g_out is aligned. Then the writes in each warp are following each other so they should be coalesced.

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

 Â g_out[addr+offset] = s_out[addr+offset];

}

Edit:

You can also write your code more flexible to the block and grid dimensions:

//at the end of kernel

int addr = blockIdx.x * 7 * blockDim.x; //block starting address

for(unsigned int offset = threadIdx.x; offset < 7 * blockDim.x; offset += blockDim.x){

  g_out[addr+offset] = s_out[addr+offset];

}

Only thing I’m not sure about is if the starting address meets the coalescing requirements. (blockIdx.x * 7 * 14 * 32)

there is an error in my code, it of course should be:

//at the end of kernel

int addr = blockIdx.x*3136+tid; //block starting address

for(unsigned int offset=0;offset<3136;offset+=448){

 g_out[addr+offset] = s_out[addr+offset];

}

Ive added tid to addr.

so its the same to what you, thanks for corretion, wrote:

//at the end of kernel

int addr = blockIdx.x*3136; //block starting address

for(unsigned int offset = threadIdx.x; offset < 3136; offset += 448){

 g_out[addr+offset] = s_out[addr+offset];

}

and final question remain unanswered - is this coalesced?

you state that writing to global like above should be ok, and only thing uncertain is block starting address (bid71432 = bid3136) - so… when starting address is ok? :huh:

According to Programming Guide:

so … type is float so 16sizeof(float) is 64. Starting address for each block is divisible by 64 : bid3136/64 = bid * 49. Starting offset for all 7 writes is also divisible by 64:

448/64 = 7

896/64 = 14

1344/64 = 21

1792/64 = 28

2240/64 = 35

2688/64 = 42

so long everything is divisible by 64, but…

ok lets than take first write (offset 0) in block 1

block starting address is then 1*3136 = 3136

lets take second warp in this block, its first halfwarp - the first thread of this halfwarp is thread 32 -> so the halfwarp starting address is 3136+32=3168

so: HalfWarpBaseAddress-BaseAddress = 3168-3136=(of course) 32.

32/16*sizeof(float)=32/64=0.5 - so it isnt divisible… but sticking to my computations coalesced access never occurs so i think i dont get it :(

anyone? :(

I think it will be coalesced, since you’re writing float values, which is 4 bytes. (3136 + 4 16halfWarpNum) is always going to be divisible by 64, so you’re all set :)

You can always run your code with the visual profiler and actually measure if it is coalesced or not.

man you made my day - simplest solutions are often best one - my fault i didnt know about profiler before :">

checked and yes, things are coalesced!

Keep in mind that the profiler will claim things are always coalesced on GT200, even when they’re clearly not…