about local device memory allocation

Hello everybody! I just started learning opencl and was reading some sample codes. In a matrix multiplication example illustrating the use of local device memory, the code is as follows:

//optimized matrix multiplication using work-group and local memory

__kernel void multMatrix(__global float *mO,

                     __global float *mA,

                     __global float *mB,

                     uint widthA, uint widthB)


uint lx = get_local_id(0);

uint ly = get_local_id(1);

int gx = get_group_id(0);

int gy = get_group_id(1);

// calculate the starting index of the global array for the each sub matrix

uint iSubA = BLOCKSIZE * gy * widthA;

uint iSubB = BLOCKSIZE * gx;

// get the number of groups in

int n = get_num_groups(0);

// varaiable to hold the running total

float sum = 0;

// for each block   

for(int i=0; i< n;i++)


	// declare local memory for each sub matrix

	__local float tA[BLOCKSIZE][BLOCKSIZE];

	__local float tB[BLOCKSIZE][BLOCKSIZE];

	// copy a portion of the input matrices into the sub matrices

	tA[ly][lx] = mA[ly*widthA + lx + (iSubA + i* BLOCKSIZE)];

	tB[ly][lx] = mB[ly*widthB + lx + (iSubB + i* BLOCKSIZE * widthB)];

	// wait for all work-items int the group to finish copying


	// multiply the two sub matrices together. 

	for(int k=0; k<BLOCKSIZE; k++)


		sum += tA[ly][k] * tB[k][lx];



// copy the final result to the output buffer

int globalIdx=get_global_id(0);

int globalIdy=get_global_id(1);

mO[globalIdy * widthA + globalIdx] = sum;


I got confused by the two lines that are highlighted with red: Since every thread will run the kernel, does it mean every thread in one work-group will allocate the local memory respectfully?

I ran the code above, the result is correct.

I thought it should go with like this:



__local float tA[BLOCKSIZE][BLOCKSIZE];

__local float tB[BLOCKSIZE][BLOCKSIZE];


so that the memory shall be allocated only by one thread in the same work group.

Could anyone help me figure this out?Thanks.

To quote the OpenCL reference on the __local address space qualifier:

So as the variables are shared across all threads, allocation can happen in any thread. In fact, do not think about the allocation actually to happen in any thread at all. Instead, allocation will probably happen before any thread exists, and each thread gets a pointer to the shared memory.

Thanks, eyebex! So can I just think that the opencl runtime did all the allocation stuff for me? And I’ve got another question, what if I want to allocate local memory in every thread, should I use different name for the memory in every thread? I know this rarely happens,I just want to make it clear.

Thanks again!

I’m slightly confused by the wording of your question. You cannot allocate __local memory that just belongs to one thread. __local memory with the same variable name will be available under that name in every thread. If you introduce a different __local variable with a different name, it will also be available in all threads.

I believe you’re confusing __local variables with “function local variables” in the sense of __private variables in OpenCL, which are accessible from a single thread only. In OpenCL, variables declared without any address qualifier are implicitly __private:

__kernel void main() {

    __local float A; // Shared across all threads / instances of main().

    float B;         // Private to main. Each thread / instance of main() gets its own instance of B.


Maybe it makes more sense for you to think of __local = shared and __private = local.

Thanks a lot, I got it.