coalesced access to global memory block-wise access vs element-wise access

I know there are many posts about coalesced memory access, but i have not found the desired information.

When transfering data from global memory to local memory, i know these accesses are coalesced until threads of are reading sequentially (k-th thread reads k-th word in global mem segment).

But as stated in Nvida OpenCL Best Practices Guide (page 13):

On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must

access the k-th word in a segment aligned to 16 times the size of the elements

being accessed; however, not all threads need to participate.

So interpreting the last sentence right if only lets say the first 4 threads reading the first 4 segments of global memory we have a coalesced access?

I’m asking because i’m working on a kernel where i need to load different blocks of global memory into local memory.

...

	//loading current block into local memory

	int pos_in_loc = local_id_x + local_id_y * local_width * 2;

	int pos_in_inp = global_id_x + global_id_y * input_width;

	loc[pos_in_loc] = input_mat[pos_in_inp];

	//loading next block on X-Axis from global to local

	if(local_id_x < 2*padding)

	{

		pos_in_loc = local_id_x + local_width + local_id_y * local_width * 2;

		pos_in_inp = global_id_x + local_width + global_id_y * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	//loading next block on Y-Axis from global to local

	if(local_id_y < 2*padding)

	{	

		pos_in_loc = local_id_x + (local_id_y + local_height) * local_width * 2;

		pos_in_inp = global_id_x + (local_height + global_id_y) * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	//loading next block on X/Y-Axis from global to local

	if(local_id_x < 2*padding && local_id_y < 2*padding)

	{

		pos_in_loc = local_id_x + local_width + (local_id_y + local_height) * local_width * 2;

		pos_in_inp = global_id_x + local_width + (local_height + global_id_y) * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	barrier(CLK_LOCAL_MEM_FENCE);

...

loc = local memory (size = work-group_widthwork_group_height4*sizeof(float))

input_mat = global memory (512x512)

The kernel execution needs 3,46 ms (average). When i’m removing if statements loading is block-wise and therefore coalesced, average kernel execution is about 3,21 ms.

As average execution time is just slightly slower I assume that these reads are also coalesced and 0,25 ms are due if statements. (block_size 16)

Can block-wise reading cause any problems as there are no controls concerning input_mat borders?

Because now the last x and y blocks access areas of global memory that are not initialized. If tried the code (with commented ifs) on ati and nvidia cards and they working fine.