numbers of write to global memory for each thread

Mu-Chi_Sung · March 31, 2008, 4:06am

Hi.

I have a question here while trying to implement my algorithm, hope to hear some suggestion. Basically, in my program, each thread may generate “numbers” of updates, which could be, said, 0~16, and all updates need to be stored in global memory (one continuous memory block). My simple thought is to have each thread pre-calculate number of updates needed and make a scan to generate the offsets, and finally each thread read the offset and write those updates to global memory. However, in this case, those writes won’t be coalesed (unless each thread write exactly 1 update).

For example, thread0 generates 3 updates, thread1 generates 1 updates, thread2 generates 2 updates…and I would like to arrange all updates like this:

data[0] = t0_0,
data[1] = t0_1,
data[2] = t0_2,
data[3] = t1_0,
data[4] = t2_0,
data[5] = t2_1…

Is it possible to get all writes coalesed? or should I consider not to store them in a continuous way?

EDIT: just thought about another approach, first let each thread dirctly write to a proper address that makes it coalesed, and use reduction to compact the array later. However this requires much more memory if the numbers of updates for each thread are few…

Thanks for any suggestion!

JHHPC · March 31, 2008, 6:07am

Hi,

if I understood it right you could put the array in shared memory and let the threads put the calculated values there. After that each thread writes out one entry of the shared memory where thread 0 writes data[0] and so on.

This should be coalesed. However, you might get some shared memory bank conflicts, but still better performance.

Hope that helps,

Johannes

Mu-Chi_Sung · March 31, 2008, 7:42am

Thanks, but I didn’t get it. You’re saying that I should write all updates to shared mem first and move them to global mem later…

So…I need to calculate the number of updates for each thread, and do scan within the block to find out the offset in shared mem…but this is just “shared mem offset”. Still I need “global mem offset” to write updates to global mem in a compact/continuous way.

Can I do global scan (which basically involves multiple blocks) without leaving current kernel context? Well, with cuda 1.1, probably I can use atomic op to obtain a global offset for entire block but it’s just toooooo slow…

PINS · March 31, 2008, 7:06pm

Hi, I think I’ll run into this problem soon.

Its basically the problem of filling a vector in parallel. We don’t know in advance how many items each thread will write. And even so, the writes would be far apart and there would be no coalescing.

I thought about doing a first pass to just count the number of elements per thread and then using scan to at least determine where each thread should write. In the second pass they would actually write the output to memory. But I haven’t thought about the coalescing part.

Maybe in this second pass, given the offsets computed in scan, it is possible to write to shared memory and then read from it sequentially (each thread reads the next element) and write them coalesced into global memory.

At first glance I think there is not enough shared memory to do this easily (without limiting occupancy).

Any thoughts? I’m sure other CUDA algorithms have found a similar dilemma. Is there something we could look in the SDK, for example?

Topic		Replies	Views
Memory coalescing in one thread CUDA Programming and Performance	17	16798	March 31, 2011
An example of coalesced memory access CUDA Programming and Performance	2	3711	June 28, 2010
Speeding up memory writes CUDA Programming and Performance	5	3312	July 3, 2008
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8331	July 31, 2007
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10283	June 28, 2009
Shared memory question CUDA Programming and Performance	27	7657	June 23, 2008
coalescing example CUDA Programming and Performance	2	1306	May 11, 2009
Global Memory Coalescing Help CUDA Programming and Performance	1	1361	June 17, 2008
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	585	December 15, 2012
read from global mem vs write to global mem CUDA Programming and Performance	13	6563	January 22, 2009

numbers of write to global memory for each thread

Related topics