I have a question here while trying to implement my algorithm, hope to hear some suggestion. Basically, in my program, each thread may generate “numbers” of updates, which could be, said, 0~16, and all updates need to be stored in global memory (one continuous memory block). My simple thought is to have each thread pre-calculate number of updates needed and make a scan to generate the offsets, and finally each thread read the offset and write those updates to global memory. However, in this case, those writes won’t be coalesed (unless each thread write exactly 1 update).
For example, thread0 generates 3 updates, thread1 generates 1 updates, thread2 generates 2 updates…and I would like to arrange all updates like this:
data = t0_0,
data = t0_1,
data = t0_2,
data = t1_0,
data = t2_0,
data = t2_1…
Is it possible to get all writes coalesed? or should I consider not to store them in a continuous way?
EDIT: just thought about another approach, first let each thread dirctly write to a proper address that makes it coalesed, and use reduction to compact the array later. However this requires much more memory if the numbers of updates for each thread are few…
Thanks for any suggestion!