I know there are many posts about coalesced memory access, but i have not found the desired information.
When transfering data from global memory to local memory, i know these accesses are coalesced until threads of are reading sequentially (k-th thread reads k-th word in global mem segment).
But as stated in Nvida OpenCL Best Practices Guide (page 13):
On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must
access the k-th word in a segment aligned to 16 times the size of the elements
being accessed; however, not all threads need to participate.
So interpreting the last sentence right if only lets say the first 4 threads reading the first 4 segments of global memory we have a coalesced access?
I’m asking because i’m working on a kernel where i need to load different blocks of global memory into local memory.
...
//loading current block into local memory
int pos_in_loc = local_id_x + local_id_y * local_width * 2;
int pos_in_inp = global_id_x + global_id_y * input_width;
loc[pos_in_loc] = input_mat[pos_in_inp];
//loading next block on X-Axis from global to local
if(local_id_x < 2*padding)
{
pos_in_loc = local_id_x + local_width + local_id_y * local_width * 2;
pos_in_inp = global_id_x + local_width + global_id_y * input_width;
loc[pos_in_loc] = input_mat[pos_in_inp];
}
//loading next block on Y-Axis from global to local
if(local_id_y < 2*padding)
{
pos_in_loc = local_id_x + (local_id_y + local_height) * local_width * 2;
pos_in_inp = global_id_x + (local_height + global_id_y) * input_width;
loc[pos_in_loc] = input_mat[pos_in_inp];
}
//loading next block on X/Y-Axis from global to local
if(local_id_x < 2*padding && local_id_y < 2*padding)
{
pos_in_loc = local_id_x + local_width + (local_id_y + local_height) * local_width * 2;
pos_in_inp = global_id_x + local_width + (local_height + global_id_y) * input_width;
loc[pos_in_loc] = input_mat[pos_in_inp];
}
barrier(CLK_LOCAL_MEM_FENCE);
...
loc = local memory (size = work-group_widthwork_group_height4*sizeof(float))
input_mat = global memory (512x512)
The kernel execution needs 3,46 ms (average). When i’m removing if statements loading is block-wise and therefore coalesced, average kernel execution is about 3,21 ms.
As average execution time is just slightly slower I assume that these reads are also coalesced and 0,25 ms are due if statements. (block_size 16)
Can block-wise reading cause any problems as there are no controls concerning input_mat borders?
Because now the last x and y blocks access areas of global memory that are not initialized. If tried the code (with commented ifs) on ati and nvidia cards and they working fine.