How to copy global memory to a local memory

Hello,

I wrote an algorithm and I want to benefit from the local memory concept, I have 2 array, array a[row][col] and C[col], and I do maths on them, and I put below the needed part to explain how I make the buffer and choose the work-group size, so my question how to use the local memory, I am a bit confuse how to choose the size of the local memory?, and if the size of the local memory is smaller than the size of the global memory so how to manage that?

thank you in advance

Buffer bufferA = Buffer(context,CL_MEM_READ_ONLY ,sizeof(cl_float) * col * row);

      Buffer bufferC = Buffer(context,CL_MEM_READ_ONLY ,sizeof(cl_float) * col );     

      Buffer bufferO = Buffer(context,CL_MEM_WRITE_ONLY ,sizeof(cl_float) * row);

kernel.setArg(0, bufferA);

      kernel.setArg(1, bufferC);     

      kernel.setArg(2, bufferO);   

      kernel.setArg(3, row);

      kernel.setArg(4, col);

NDRange globalNDRange(row);  //Total number of work items

      NDRange localNDRange(1);   //Work items in each work-group

queue.enqueueNDRangeKernel(kernel, NDRange(), globalNDRange, localNDRange, NULL, &event);
__kernel void rmsCalculation(const __global  float* a,

                        const __global float * C,      

         __global float * O,                     

         const int row,

                        const int col)

{

const int ar = get_global_id(0);   

float R=0;   

   float I=0;   

       float c=0;   

   float sum=0;

for(int j=0;j<col; ++j)

   {

     c = C[j] * a[ar * col + j]; // C*number of repetition

     I=0;

do

       {

        R = I + c;

        I=0;      

for(int k=0 ; k<j ; ++k)

        {           

         I = I + C[k] * a[ar * col + k];

        }        

}while(I+c > R);

sum = sum + R;          

}//end for(j=0..

O[ar]=sum;   

}

The primary usage of local memory is when the threads of a work group uses the same value from global memory multiple times. It is then more efficient to move the popular value from global memory to shared memory and then let the individual work items read from there instead. You appear to only use a single work item in each work group, so this does not apply to you directly. However, using single-itemed work groups are really bad on GPUs. Its OK while testing algorithms for correctness, but larger work groups should be your primary concern before using shared memory. I’m assuming you are using a GPU since we are on the NVidia forums. Luckily, I don’t see any requirement on the work group size, so just specifying a larger work group size to ‘enqueueND…’, making sure the global range is valid and doing boundary checks in the kernel should be enough.

Regarding shared memory again, your ‘a’ matrix seems quite big and each work item appears to use its own part of this data. I deduced this from the following lines, let me know if I’m wrong.

const int ar = get_global_id(0);

for(int k=0 ; k<j ; ++k)

  I = ... a[ar * col + k];

In essence, I view this operation as a complicated matrix-vector product where each work item is responsible for one element in the output vector ‘O’, which is the result of operating on one row of ‘a’ and the entire vector ‘C’. So, with support of my first few rows of this post, you could add an additional loop over your outer loop that break that outer loop into blocks. A work group reads a good sized segment from ‘C’ into shared memory and then do the computations that require that part of ‘C’. Then a new block is fetched from ‘C’, overwriting the part just used and another set of iterations in the inner loops is done. Repeat until the whole ‘C’ has been processed.

It might be better to leave your outer loop intact and do the above on the inner for-loop instead. You will have to either think about what your particular input data will do to the loop iterations or just experiment. I recommend doing both.

As a side note, your memory strides when reading from ‘a’ in the code I cited above looks bad. You should investigate into other matrix storage formats.

Thank you for your nice reply, and I am beginner to the opencl and its easy to recognize that from my questions External Image

thank you for you notice about the number of work-item in each work group, since i tried to increase it and the performance increase :), but should the size of each work group be related to the total number of work items?! and is there is a best size for the number of work-items in each work-group?

yes exactly as you said, each work item is responsible about and row of ‘a’ and do this calculation with the vector ‘C’, and after that this work-item store a result in the output vecotr ‘O’, and I make each work-item responsilbe about one row in ‘a’ since each elements of each row are related in the computation.

the vector ‘C’ is small when comparing it with ‘a’ , so I should change the size of them for testing, but the maximum size of ‘C’ will be 40, so which array I should move to the shared memory ‘C’ or ‘a’ or both? and should the size of the block be related to the size of work-group?

I saw this in a matrix multiplication example so this is why is used it, can you please explain what you mean by “You should investigate into other matrix storage formats.”

This is a very hard question. It is often not related to the global work size. Instead, the size of the work groups should be determined by the limited resources available on the particular device that is going to execute the kernel. This includes register usage, shared memory usage, optimal occupancy, hardware implementation limitations and such. This is probably not a complete list. One should choose the work group size that gives the best performance, but finding that size is not easy, expect for trying a few and see what happens. The OpenCL framework provides “clGetKernelWorkGroupInfo” that can tell you what the maximum work group size for a kernel is. While not always optimal, this is at least a decent first guess. Also, one can pass in NULL instead of an explicit value. This lets the OpenCL implementation choose the work group size for you. I haven’t tried this much, but rumor has it that it can be very bad at choosing sizes. Try it and see what happens.

It all depends on the sizes of each of the data sets and the size of the available shared memory on the device. If you can fit everything into shared memory then that might be the way to go. This is however rare since shared memory is limited and if the data set is so small that all of it fits, then I’m guessing that the problem is to small to be solved efficiently on a GPU. GPUs require massive parallelism in order to work efficiently. You say that the size of ‘C’ is 40, which is a very small number and this is a bit worrisome. What is the size of ‘a’? Also consider that using more shared memory in each work group reduces the number of work groups that the GPU can have on-chip, which in turn reduces the number of active work items, which in turn reduces the hardware’s ability to do its latency hiding. All of this becomes a complicated consideration with multiple parts that play a role. It takes quite a bit of experience in order to be able to see or feel what will be a good trade off between all these parameters. If someone knows of a good way to estimate this then please let us know. In your case I’m guessing that ‘C’ is better to have in shared memory since that is the shared data.

What you should do is to make a small example on paper. Draw a long line that represents the global memory that holds a small example of ‘a’. Then, in your head, execute the kernel for some small number of work items, say a work group of four or something. These four work items execute in parallel so all memory operations happen at the same time. For the first memory operation, mark on the paper the memory location that each work item accesses with a color, same color for all work items. Then continue the execution until the next memory operation and mark all locations on the paper with a new color. What you want is to have each color region to be contiguous, since this is the way that the hardware can do the fastest memory transfers. I expect you’ll find that each color will be strided on the paper, where instead the memory operations PER WORK ITEM are contiguous.

There are plenty of matrix storage formats for GPUs described in the literature, a googling should give you a few, but in this case it seems you can make do with a dense rectangular storage format. Which is what you have right now. What you can do is to change it from a row major format to a column major one, where the elements walk down the columns instead of along the rows as the memory address increases. Figure out how to do the addressing calculations, currently ar * col + k for the row major format, for a column major format and then do the paper exercise again. If all is correct, then you should get nice contiguous colored blocks and hopefully better performance when you run on the GPU.

Thank you for this explanation its more clear now, and I will try to go through them .