local / global work (group) sizes and memory limit calculations How to find out how much private mem

sonicx · April 6, 2011, 4:22pm

Hey.
I am trying to feed rasterdata in form of textures to my gpu, and use that as input for my opencl calculations.
Want to fit the most data into the gpu’s memory as possible. As the whole dataset is very large i cut it into tiles,
which i upload one by one to the gpu and calculate them one after another.
My tiles have a certain dimension, where every pixel represents a work item.
Now my tiles have also a buffer around them, those pixels dont count as work-items.
And i have some private arrays i need to do the per-pixel/work-item calculations.

In order to determine the optimum tilesize i can upload to gpu, i check for max_texture_size the gpu supports and the amount of global memory it has.
But to really get my optimum tilesize i need to know how many threads will execute in parallel, and thusly how many private arrays i need to reserve memory for.

It is something like max_running_threads*private_array_bytes i assume,
but how do i define or determine the maximum amount of work-items that will be worked on in parallel?

Just using a globalWorkSize * private_array_bytes works, but obviously wastes most of the memory, as only a small portion of the workItems will be worked on in parallel, and
only those need that memory for their arrays.
That means i waste memory i could be using to stuff more actual texture-data in there.

I thought about the local-work-size and the amount of cores having to do with it. But if i can set the local-work-size, it cant be as simple as in “the amount of items per local group equals the amount of threads running in parallel”.

Someone help me out here and give me some pointers …

sonicx · April 15, 2011, 10:40am

My own answer:
When trying to streamcompute massive data, fitting neither RAM nor GPU, one has to split the data in a way that:

Each chunk fits into RAM.
Each single buffer-object used has to fit into CL_DEVICE_MAX_MEM_ALLOC_SIZE.
The sum of sizes of all buffer-objects + ( (the private bytes you used in your kernel) * localWorkSize) has to fit into CL_DEVICE_GLOBAL_MEM_SIZE.
The localWorkSize has to have CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS dimensions of maxsize CL_DEVICE_MAX_WORK_ITEM_SIZES.
The total localWorkSize has to be smaller than min(CL_DEVICE_MAX_WORK_GROUP_SIZE,CL_KERNEL_WORK_GROUP_SIZE)
The total localWorkSize has to be a multiple of the device’s warp-size (64 for me).
The globalWorkSize has to be a multiple of localWorkSize, for my solution every dimension is a multiple of the corresponding global dimensions, with differing factors per dimension possible.
The globalWorkSize has to be smaller than pow(2,CL_DEVICE_ADDRESS_BITS), as a whole as well as in each dimension.
And last but not least you have to make sure that if you use image-buffers, your buffers obey CL_DEVICE_IMAGE2D_MAX_WIDTH x CL_DEVICE_IMAGE2D_MAX_HEIGHT.

That means that the size and amount of data chunks you can push to your gpu depends first on RAM, then on gpu-type, then on gpu-mem and then on the kernel you use.

You will notice that your globalWorkSize is bigger than the amount of items you actually wanted to use, even if you try to match the amount of items you really have. Just skip superflous items in your kernels, believe me, its still way faster than using some global/local workSize that fits your data but not your device.

My way was to basically calculate all the needed dimensions and sizes on Host > App > Context > Device > Kernel levels, recalculating on level where needed, ie. recalculate Kernel-dependent values when the kernel changes and so on.

OpenCL 1.1 will help out here by providing means to get the amount of private bytes used by a kernel without adding them up manually.
Right now i actually have to read through the kernel-sources and sum up all variables i use to get accurate values.

But however, that way i was finally able to use like 99.9% of my GPU’s memory without seemingly random CL_OUT_OF_YO_MAMMA. (Make sure you use the notify-callback you can pass at context-creation to really get all of those errors in time.)

Please feel free to correct/expand.

menohack · April 18, 2011, 2:12pm

How can you do this? When OpenCL compiles the kernel, it optimizes the variables away. So you would have to look at the binary and I guess keep track of the number of registers used? I’m not exactly sure how to do this.

When I try -cl-opt-disable as a flag for clBuildProgram I get CL_INVALID_BINARY.

sonicx · August 15, 2011, 3:39pm

You are right - it’s by no means a good way. I’d prefer to use CL_KERNEL_PRIVATE_MEM_SIZE to get the real amount, but it won’t work for me (never returns usuable values). However - the compiler optimizing away doesnt really matter, as with the max. value i counted im still on the safe side - even if the real value is lower.

Topic		Replies	Views
How to copy global memory to a local memory CUDA Programming and Performance	4	6570	August 1, 2011
Optimal number of work-groups and work-items for GPU CUDA Programming and Performance	1	3542	November 17, 2011
oclHistogram sample. Don't understand shared memory restrictions.... CUDA Programming and Performance	11	5249	March 27, 2011
Questions about global and local work size CUDA Programming and Performance	23	55360	November 1, 2010
troubles with local memory and work items CUDA Programming and Performance	0	8189	January 24, 2011
performance question CUDA Programming and Performance	9	9931	August 4, 2010
CL_INVALID_WORK_GROUP_SIZE with clEnqueueNDRangeKernel CUDA Programming and Performance	12	12207	April 3, 2012
OpenCL performances on NVIDIA GTX 260 and ATI Radeon HD CUDA Programming and Performance	9	3885	December 29, 2012
LOCAL MEM SIZE is per compute unit? newbie question CUDA Programming and Performance	1	1202	February 1, 2011
memory coalescing CUDA Programming and Performance	7	5047	February 2, 2010

local / global work (group) sizes and memory limit calculations How to find out how much private mem

Related topics