memory coalescing

jpsollie · January 21, 2010, 9:10am

I am designing an app which will have the following (big) performance hit:
each work-item needs, in one specific function, an array of 64 ints dedicated to itself to do its operation.

I already considered a lot of redesigns, but those cannot be implemented:

using local memory is not an option: I have 2048 bytes of local memory for 96 work-items.
using textures is also not an option: I need read/write access, and images in openCL only allow read OR write.

so I guess I’m stuck with private memory :(

there’s one thing that might be helping: memory coalescing:
usually, each work-item will adress the array like storage[index], where index is a local memory variable.

it might be helpful to redesign the array in “virtual” local memory (I mean: it is put in global memory, but has the scope of local memory) as such:

storage[index][threadid]

where threadid is a number between 0 and 96.
This will always be a coalesced memory operation.

The problem I’m stuck with is the “virtual” idea:
there will be 10.000 work-groups in the task. it’s not possible to do something like storage[workgroupid][index][threadid]: this requires 240M of video memory, and I’m not sure every device can allocate 240M just for this storage space.
so I need to “allocate” storage at the beginning of a workgroup, and “free” it at the end.
but, if I’d do this, I need int*** pointers, and I thought pointers to pointers are not allowed in openCL

can anybody give me a hint? External Image

_Big_Mac · January 21, 2010, 11:10am

I haven’t found anything in the specs that would say pointers to pointers aren’t allowed in kernel code.

jpsollie · January 21, 2010, 11:52am

you 're right. sorry.

It seems only pointers to pointers are not allowed as function arguments, but they can perfectly exist :)

neverthless, as far as I know, there is no possibility to do dynamic memory allocation in openCL kernel code, so the problem still holds

_Big_Mac · January 21, 2010, 3:37pm

It seems to me that private memory is the best thing you can do here. Those accesses may actually be coalesced if the access pattern is constant for all work-items - I remember that accesses to private memory was guaranteed to be coalesced in CUDA (where it’s called local memory for maximum confusion), though I don’t know how that would work in hardware…

Placing this array in global memory wouldn’t be any better than in private, probably.

So, just put “int storage[64];” in kernel code and it will be statically allocated in private memory (which aliases cleverly to global memory). If accesses indeed become coalesced and if you have plenty of computation to hide latency, you might actually find this working reasonably fast.

Dynamic allocation from within a kernel is indeed impossible in OpenCL.

jpsollie · January 21, 2010, 4:43pm

all right, I’ll design it this way, thanks for your fast answers External Image

achinda99 · February 2, 2010, 7:58pm

I was wondering, how exactly do you declare something for use as private memory. I know this is an amateur question, but I just haven’t seen any clear documentation on it. A simple example would be great. Thanks!

_Big_Mac · February 2, 2010, 10:19pm

All variables you declare in a kernel or a function without an explicit address space qualifier end up as private. You can also do this explicitly by specifying

__private int a;

__private float b[200];

You can also use “private” (without the underscores).

achinda99 · February 2, 2010, 10:46pm

** EDIT **

I realized my question was pretty retarded and am retracting it.

Topic		Replies	Views
Shared memory question CUDA Programming and Performance	27	7674	June 23, 2008
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2316	May 28, 2009
Local faster than global. Why? CUDA Programming and Performance	15	13131	March 20, 2009
Memory coalescing and multiple arrays CUDA Programming and Performance	23	12031	March 20, 2009
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5352	September 6, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16807	March 31, 2011
Pass arguments through constant memory CUDA Programming and Performance	20	8717	August 11, 2010
__private memory questions CUDA Programming and Performance	8	6553	August 26, 2010
about local device memory allocation CUDA Programming and Performance	4	1298	August 25, 2011
pointer in global device memory CUDA Programming and Performance	9	11753	November 23, 2011

memory coalescing

Related topics