__local (Shared) memory increases kernel execution time?

I have some code I am trying to optimize. Just based on some various experiments I have done it seems that I am currently bandwidth limited. I wanted to change a variable that I write to from __global memory space to __local memory space. I would assume that would give me significantly more memory bandwidth. Here is the snippet of my OpenCL code when I am writing to global. Most if this is proprietary so this is kind of a generic example.

HOST CODE

cl::NDRange globalWorkItems(64, 256);

cl::NDRange localWorkItems(1, 128);

status = m_cmd_queue.enqueueNDRangeKernel(m_kernel, cl::NullRange, globalWorkItems, localWorkItems, 0, &kernel_event[0]);

KERNEL CODE

__kernel void MyKernel(__global float *single_ray, __global float *density, __global float *original, int ystop, int y_decimation)

{

  int num_x = get_global_size(1);

  int num_y = 64;

  int det_ind = get_global_id(0);

  int x_ind = get_global_id(1);

  float coh = 1.734f;

for (int i = 0; i < ystop; i++)

  {

    int y_index = (int)i/y_decimation;

    int ray_offset = det_ind*num_x*num_y + y_index*num_x + x_ind;

    single_ray[ray_offset] += density[some_index]*original[some_index]*coh;

  }

When I convert the variable single_ray to __local (shared) memory, the execution is nearly 3 times as long! I’m not sure why exactly this would happen, as I would expect it to be much faster. Can anyone offer any insight? I can try to provide more details if this explanation isn’t clear enough. Unfortunately I will not be able to post the entire code.

I have some code I am trying to optimize. Just based on some various experiments I have done it seems that I am currently bandwidth limited. I wanted to change a variable that I write to from __global memory space to __local memory space. I would assume that would give me significantly more memory bandwidth. Here is the snippet of my OpenCL code when I am writing to global. Most if this is proprietary so this is kind of a generic example.

HOST CODE

cl::NDRange globalWorkItems(64, 256);

cl::NDRange localWorkItems(1, 128);

status = m_cmd_queue.enqueueNDRangeKernel(m_kernel, cl::NullRange, globalWorkItems, localWorkItems, 0, &kernel_event[0]);

KERNEL CODE

__kernel void MyKernel(__global float *single_ray, __global float *density, __global float *original, int ystop, int y_decimation)

{

  int num_x = get_global_size(1);

  int num_y = 64;

  int det_ind = get_global_id(0);

  int x_ind = get_global_id(1);

  float coh = 1.734f;

for (int i = 0; i < ystop; i++)

  {

    int y_index = (int)i/y_decimation;

    int ray_offset = det_ind*num_x*num_y + y_index*num_x + x_ind;

    single_ray[ray_offset] += density[some_index]*original[some_index]*coh;

  }

When I convert the variable single_ray to __local (shared) memory, the execution is nearly 3 times as long! I’m not sure why exactly this would happen, as I would expect it to be much faster. Can anyone offer any insight? I can try to provide more details if this explanation isn’t clear enough. Unfortunately I will not be able to post the entire code.

How large is your single_ray array ? If it is too large, you may not be able to have enough work groups running on a core simultaneously to achieve full utilization. As far as I know, the older (GT8800, GT2xx) cards have only 16K of __local memory per core, and you’ll want about 8 work groups of 256 threads to achieve full utilization, so any __local buffer of more than ~1.5K or so (you need some of the local memory for the threads’ registers) will reduce utilization and thus performance.

See NVidia’s guide on best practices for OpenCL (CUDA Toolkit Documentation 12.3 Update 1) for more details.

How large is your single_ray array ? If it is too large, you may not be able to have enough work groups running on a core simultaneously to achieve full utilization. As far as I know, the older (GT8800, GT2xx) cards have only 16K of __local memory per core, and you’ll want about 8 work groups of 256 threads to achieve full utilization, so any __local buffer of more than ~1.5K or so (you need some of the local memory for the threads’ registers) will reduce utilization and thus performance.

See NVidia’s guide on best practices for OpenCL (CUDA Toolkit Documentation 12.3 Update 1) for more details.

I’m using the Tesla C2050 card, and it reports there are 49152 bytes of shared memory available. I would assume that is per core. I have tried different combinations of the local work size which directly affects the size of the single_ray array. The size is always 64*get_local_size(1)*sizeof(float). The local size can be varied between 16 and 128 (making the array between 4096 and 32768 bytes), but in each case I still get worse performance than when single_ray is a global buffer.

I reckon you get a bit confused with local memory. Local memory is allocated for a whole block of threads. So indexing it with global ID is nonsense. Or you have to allocate for every block of threads array of size appropriate to the number of all threads… which isn’t your case I supppose

int det_ind = get_global_id(0); //REQUEST GLOBAL ID!!!

int ray_offset = det_ind*num_x*num_y + y_index*num_x + x_ind;

single_ray[ray_offset] ....

I wonder, is there 64 extra? I suppose it should be only get_local_size(1)*sizeof(float)…

Sorry, when I use local memory I do utilize the local_id, something like this:

int local_id = get_local_id(1);

int ray_offset = y_index*num_x + local_id;

single_ray[ray_offset] ....

There’s another for loop within my code, which is why I need the 64*get_local_size(1)*sizeof(float). Note that ystop is typically 256 and y_decimation is typically 4, hence the 64:

for (int y = 0; y < ystop; y++)

{

  int y_index = y / y_decimation; // y_decimation is a constant

  int ray_offset = y_index*num_x + local_id;

  single_ray[ray_offset] ....

}

It may seem more logical to make my workgroups three-dimensional, instead of (64, 256) it would be (64, 64, 256) to get rid of this “ystop” for loop. It’s not possible though if single_ray is __local, because all of the work items need to have access to all 64 “y_indexes” later in the code, if that makes sense. If my local work size were something like (1, 1, 256) in this case, each kernel instance would only have valid data in 1 of the y_index positions.

If you’re using 4K minimum of __local memory per workgroup, I still think (under)utilization may be your problem. I suggest using NVidia’s profiler to check if there’s anything odd going on when using __local instead of __global memory (like ‘occupancy’ dropping below 1). Two other things that might cause your problem:

  1. Are you copying data from/to the __local buffer ? If so, perhaps the copying is done using a suboptimal memory access pattern.

  2. The NVidia best practices guide goes on to great lengths about making sure to access (__local) memory in a way that minimizes bank access conflicts (as far as I understood it, you want to make sure that each active thread accesses a different memory bank to avoid having to serialize the memory accesses).

Though to be honest, I can’t imagine either 1) or 2) slowing the code down enough to make __global memory access faster than __local memory access.

Thanks, now it is a bit clearer. I think eric might be right with the underutilization. You still have global memory access and if the occupancy is low, the latency won’t be hidden by switching to another block of threads.

I think chopping all arrays (single_ray, density, original) in parts and load it in local memory with an coalesced read and then do the work would be better optimization than having the whole single_ray in local memory… yet it depends if it is possible and how you need to index…

single_ray[ray_offset] += density[some_index]*original[some_index]*coh;

Eric, thanks for your help.

1.) I am not copying any data from/to this local buffer. Simply writing to it the results of a calculation that I will use for later calculations (hence no need to store them in a global buffer since they are temporary). This write, as you can see from the index (ray_offset) should be coalesced b/c it corresponds to the local index.

2.) I will read through this guide for a refresher, because you’re probably right that I am missing something in terms of how I’m utilizing this memory.

I have tried to use the profiler, but haven’t had any luck with this tool. Running on SLES 11 SP1, I always get an error when I launch my application. Specifically it says:

"Application: “myexecutable”.

Profiler data file

‘my_home_directory/temp_compute_profiler_0_0.csv’

for application run 0 not found."

I’m not sure what that means, but it’s probably a topic for another thread.

This is a good point about mixing global/local memory accesses (though like Eric suggested, not sure why this would make it slower). I will look through my code and see if I can do what you suggested. The difficult part might be having enough local memory to do this though.