More on work-list type algorithms and Persistent threads


I am trying to optimize work-list sharing type algo on GPU where 1-set of data, after some computations, generates Two or Zero new sets of data in the work-list to work on.

I am trying to use persistent threads to design this system.

I have couple of doubts regarding this –

  • I saw this written somewhere that block size of a persistent thread type system should be equal to warp size 32. - Is this necessary? and why is that? (I see that it makes it easier to synchronize work fetching but why else?)
  • What is a smart way of having a warp-wise synchronization? - Will making software barriers using simple while etc cause performance slowdowns or warp-divergence?

I will be posting more issues as I get them.
thanks all

Gentle bounce :)