I am trying to feed rasterdata in form of textures to my gpu, and use that as input for my opencl calculations.
Want to fit the most data into the gpu’s memory as possible. As the whole dataset is very large i cut it into tiles,
which i upload one by one to the gpu and calculate them one after another.
My tiles have a certain dimension, where every pixel represents a work item.
Now my tiles have also a buffer around them, those pixels dont count as work-items.
And i have some private arrays i need to do the per-pixel/work-item calculations.
In order to determine the optimum tilesize i can upload to gpu, i check for max_texture_size the gpu supports and the amount of global memory it has.
But to really get my optimum tilesize i need to know how many threads will execute in parallel, and thusly how many private arrays i need to reserve memory for.
It is something like max_running_threads*private_array_bytes i assume,
but how do i define or determine the maximum amount of work-items that will be worked on in parallel?
Just using a globalWorkSize * private_array_bytes works, but obviously wastes most of the memory, as only a small portion of the workItems will be worked on in parallel, and
only those need that memory for their arrays.
That means i waste memory i could be using to stuff more actual texture-data in there.
I thought about the local-work-size and the amount of cores having to do with it. But if i can set the local-work-size, it cant be as simple as in “the amount of items per local group equals the amount of threads running in parallel”.
Someone help me out here and give me some pointers …