I am designing an app which will have the following (big) performance hit:
each work-item needs, in one specific function, an array of 64 ints dedicated to itself to do its operation.
I already considered a lot of redesigns, but those cannot be implemented:
- using local memory is not an option: I have 2048 bytes of local memory for 96 work-items.
- using textures is also not an option: I need read/write access, and images in openCL only allow read OR write.
so I guess I’m stuck with private memory :(
there’s one thing that might be helping: memory coalescing:
usually, each work-item will adress the array like storage[index], where index is a local memory variable.
it might be helpful to redesign the array in “virtual” local memory (I mean: it is put in global memory, but has the scope of local memory) as such:
where threadid is a number between 0 and 96.
This will always be a coalesced memory operation.
The problem I’m stuck with is the “virtual” idea:
there will be 10.000 work-groups in the task. it’s not possible to do something like storage[workgroupid][index][threadid]: this requires 240M of video memory, and I’m not sure every device can allocate 240M just for this storage space.
so I need to “allocate” storage at the beginning of a workgroup, and “free” it at the end.
but, if I’d do this, I need int*** pointers, and I thought pointers to pointers are not allowed in openCL
can anybody give me a hint?