local / global work (group) sizes and memory limit calculations How to find out how much private mem

Hey.
I am trying to feed rasterdata in form of textures to my gpu, and use that as input for my opencl calculations.
Want to fit the most data into the gpu’s memory as possible. As the whole dataset is very large i cut it into tiles,
which i upload one by one to the gpu and calculate them one after another.
My tiles have a certain dimension, where every pixel represents a work item.
Now my tiles have also a buffer around them, those pixels dont count as work-items.
And i have some private arrays i need to do the per-pixel/work-item calculations.

In order to determine the optimum tilesize i can upload to gpu, i check for max_texture_size the gpu supports and the amount of global memory it has.
But to really get my optimum tilesize i need to know how many threads will execute in parallel, and thusly how many private arrays i need to reserve memory for.

It is something like max_running_threads*private_array_bytes i assume,
but how do i define or determine the maximum amount of work-items that will be worked on in parallel?

Just using a globalWorkSize * private_array_bytes works, but obviously wastes most of the memory, as only a small portion of the workItems will be worked on in parallel, and
only those need that memory for their arrays.
That means i waste memory i could be using to stuff more actual texture-data in there.

I thought about the local-work-size and the amount of cores having to do with it. But if i can set the local-work-size, it cant be as simple as in “the amount of items per local group equals the amount of threads running in parallel”.

Someone help me out here and give me some pointers …

My own answer:
When trying to streamcompute massive data, fitting neither RAM nor GPU, one has to split the data in a way that:

Each chunk fits into RAM.
Each single buffer-object used has to fit into CL_DEVICE_MAX_MEM_ALLOC_SIZE.
The sum of sizes of all buffer-objects + ( (the private bytes you used in your kernel) * localWorkSize) has to fit into CL_DEVICE_GLOBAL_MEM_SIZE.
The localWorkSize has to have CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS dimensions of maxsize CL_DEVICE_MAX_WORK_ITEM_SIZES.
The total localWorkSize has to be smaller than min(CL_DEVICE_MAX_WORK_GROUP_SIZE,CL_KERNEL_WORK_GROUP_SIZE)
The total localWorkSize has to be a multiple of the device’s warp-size (64 for me).
The globalWorkSize has to be a multiple of localWorkSize, for my solution every dimension is a multiple of the corresponding global dimensions, with differing factors per dimension possible.
The globalWorkSize has to be smaller than pow(2,CL_DEVICE_ADDRESS_BITS), as a whole as well as in each dimension.
And last but not least you have to make sure that if you use image-buffers, your buffers obey CL_DEVICE_IMAGE2D_MAX_WIDTH x CL_DEVICE_IMAGE2D_MAX_HEIGHT.

That means that the size and amount of data chunks you can push to your gpu depends first on RAM, then on gpu-type, then on gpu-mem and then on the kernel you use.

You will notice that your globalWorkSize is bigger than the amount of items you actually wanted to use, even if you try to match the amount of items you really have. Just skip superflous items in your kernels, believe me, its still way faster than using some global/local workSize that fits your data but not your device.

My way was to basically calculate all the needed dimensions and sizes on Host > App > Context > Device > Kernel levels, recalculating on level where needed, ie. recalculate Kernel-dependent values when the kernel changes and so on.

OpenCL 1.1 will help out here by providing means to get the amount of private bytes used by a kernel without adding them up manually.
Right now i actually have to read through the kernel-sources and sum up all variables i use to get accurate values.

But however, that way i was finally able to use like 99.9% of my GPU’s memory without seemingly random CL_OUT_OF_YO_MAMMA. (Make sure you use the notify-callback you can pass at context-creation to really get all of those errors in time.)

Please feel free to correct/expand.

How can you do this? When OpenCL compiles the kernel, it optimizes the variables away. So you would have to look at the binary and I guess keep track of the number of registers used? I’m not exactly sure how to do this.

When I try -cl-opt-disable as a flag for clBuildProgram I get CL_INVALID_BINARY.

You are right - it’s by no means a good way. I’d prefer to use CL_KERNEL_PRIVATE_MEM_SIZE to get the real amount, but it won’t work for me (never returns usuable values). However - the compiler optimizing away doesnt really matter, as with the max. value i counted im still on the safe side - even if the real value is lower.