I am designing an app which will have the following (big) performance hit:
each work-item needs, in one specific function, an array of 64 ints dedicated to itself to do its operation.
I already considered a lot of redesigns, but those cannot be implemented:
using local memory is not an option: I have 2048 bytes of local memory for 96 work-items.
using textures is also not an option: I need read/write access, and images in openCL only allow read OR write.
so I guess I’m stuck with private memory :(
there’s one thing that might be helping: memory coalescing:
usually, each work-item will adress the array like storage[index], where index is a local memory variable.
it might be helpful to redesign the array in “virtual” local memory (I mean: it is put in global memory, but has the scope of local memory) as such:
storage[index][threadid]
where threadid is a number between 0 and 96.
This will always be a coalesced memory operation.
The problem I’m stuck with is the “virtual” idea:
there will be 10.000 work-groups in the task. it’s not possible to do something like storage[workgroupid][index][threadid]: this requires 240M of video memory, and I’m not sure every device can allocate 240M just for this storage space.
so I need to “allocate” storage at the beginning of a workgroup, and “free” it at the end.
but, if I’d do this, I need int*** pointers, and I thought pointers to pointers are not allowed in openCL
It seems to me that private memory is the best thing you can do here. Those accesses may actually be coalesced if the access pattern is constant for all work-items - I remember that accesses to private memory was guaranteed to be coalesced in CUDA (where it’s called local memory for maximum confusion), though I don’t know how that would work in hardware…
Placing this array in global memory wouldn’t be any better than in private, probably.
So, just put “int storage[64];” in kernel code and it will be statically allocated in private memory (which aliases cleverly to global memory). If accesses indeed become coalesced and if you have plenty of computation to hide latency, you might actually find this working reasonably fast.
Dynamic allocation from within a kernel is indeed impossible in OpenCL.
I was wondering, how exactly do you declare something for use as private memory. I know this is an amateur question, but I just haven’t seen any clear documentation on it. A simple example would be great. Thanks!
All variables you declare in a kernel or a function without an explicit address space qualifier end up as private. You can also do this explicitly by specifying
__private int a;
__private float b[200];
You can also use “private” (without the underscores).