__local memory

Hi all,
what’s the difference between declaring a __local variable inside the kernel or as argument?

In the NVIDIA examples (more precisely in the oclMatrixMul) the __local buffers are alocated in the kernel args, and an ‘empty’ arg is set to the kernel.
I’ve done a quick test and, instad of passing as argument, I’ve declared the same buffer inside the kernel and the result was the same.

What’s the difference so, and why should anyone use one or another way?

Thanks in advance

I think the difference is that when you are passing local memory buffer as an argument, you can set its size dynamically in runtime. If you declare the buffer inside the kernel, you have to recompile it if you want to change local memory size.

I think that too. Besides from that, if you “allocate” it inside your kernel code, you are able to know how much __local memory your kernel requires and if you pass it as an argument, you cant’t and you can run into CL_OUT_OF_RESOURCES problem while running the kernel (that can happen in the other case as well anyway :-P). “Allocating” the __local memory in the kernel imposes the same limitations as “allocating” arrays on stack in standard C/C++ – the size cannot be “dynamic”, just literal constants are accepted as sizes.

the __local is also passed as arg in the oclHistogram example. in that context it is shared by warps. so if you want warps to access the same resource, then you pass as arg.

Not correct. As an argument, or defined in the kernel, all work items in the work group can access the __local memory.

all work items are broken down to warps and processed. so if warps can access local multiprocessor memory then all the work items certainly can, so my earlier stmt stands ok.

My point was that if you declare memory as __local inside the kernel - and not as an argument - then all work items in the group can access that memory.

Incidentally, if you are spending a lot of time working through these examples, perhaps we can collaborate? I could certainly do with bouncing questions off someone.

The dynamic / static tradeoffs discussed above are good. When doing dynamic, you defer the exact size until run time, passing it as an arg. One reason for static might be if calcing the size is a pain, like structures, arrays of structures & to a lesser extent multi-dim arrays of primitives. Clogging up your host code with bookkeeping, taking into account alignment, adds NO value, and only makes it more difficult to change the structure months later.

When mixing dynamic & static locals, you can look at clGetKernelWorkGroupInfo(), CL_KERNEL_LOCAL_MEM_SIZE, to seek how much is taken already by static. It is important to call this before set kernel args, or dynamic locals will also be in the #.

currently its only the histogram one that i am interested in because i have to get it implemented for my project. however that has taken me to another journey, it is day 3 now. but i am understanding more and more…i will post what i understood in the histo post and we can go about it.i intend to crack it up wide open