numbers of write to global memory for each thread


I have a question here while trying to implement my algorithm, hope to hear some suggestion. Basically, in my program, each thread may generate “numbers” of updates, which could be, said, 0~16, and all updates need to be stored in global memory (one continuous memory block). My simple thought is to have each thread pre-calculate number of updates needed and make a scan to generate the offsets, and finally each thread read the offset and write those updates to global memory. However, in this case, those writes won’t be coalesed (unless each thread write exactly 1 update).

For example, thread0 generates 3 updates, thread1 generates 1 updates, thread2 generates 2 updates…and I would like to arrange all updates like this:

data[0] = t0_0,
data[1] = t0_1,
data[2] = t0_2,
data[3] = t1_0,
data[4] = t2_0,
data[5] = t2_1…

Is it possible to get all writes coalesed? or should I consider not to store them in a continuous way?

EDIT: just thought about another approach, first let each thread dirctly write to a proper address that makes it coalesed, and use reduction to compact the array later. However this requires much more memory if the numbers of updates for each thread are few…

Thanks for any suggestion!


if I understood it right you could put the array in shared memory and let the threads put the calculated values there. After that each thread writes out one entry of the shared memory where thread 0 writes data[0] and so on.

This should be coalesed. However, you might get some shared memory bank conflicts, but still better performance.

Hope that helps,


Thanks, but I didn’t get it. You’re saying that I should write all updates to shared mem first and move them to global mem later…

So…I need to calculate the number of updates for each thread, and do scan within the block to find out the offset in shared mem…but this is just “shared mem offset”. Still I need “global mem offset” to write updates to global mem in a compact/continuous way.

Can I do global scan (which basically involves multiple blocks) without leaving current kernel context? Well, with cuda 1.1, probably I can use atomic op to obtain a global offset for entire block but it’s just toooooo slow…

Hi, I think I’ll run into this problem soon.

Its basically the problem of filling a vector in parallel. We don’t know in advance how many items each thread will write. And even so, the writes would be far apart and there would be no coalescing.

I thought about doing a first pass to just count the number of elements per thread and then using scan to at least determine where each thread should write. In the second pass they would actually write the output to memory. But I haven’t thought about the coalescing part.

Maybe in this second pass, given the offsets computed in scan, it is possible to write to shared memory and then read from it sequentially (each thread reads the next element) and write them coalesced into global memory.

At first glance I think there is not enough shared memory to do this easily (without limiting occupancy).

Any thoughts? I’m sure other CUDA algorithms have found a similar dilemma. Is there something we could look in the SDK, for example?