Hi,
I am trying to optimize work-list sharing type algo on GPU where 1-set of data, after some computations, generates Two or Zero new sets of data in the work-list to work on.
I am trying to use persistent threads to design this system.
I have couple of doubts regarding this –
- I saw this written somewhere that block size of a persistent thread type system should be equal to warp size 32. - Is this necessary? and why is that? (I see that it makes it easier to synchronize work fetching but why else?)
- What is a smart way of having a warp-wise synchronization? - Will making software barriers using simple while etc cause performance slowdowns or warp-divergence?
I will be posting more issues as I get them.
thanks all
Sid.