coalesced access to global memory block-wise access vs element-wise access

noxnet · March 17, 2010, 12:26pm

I know there are many posts about coalesced memory access, but i have not found the desired information.

When transfering data from global memory to local memory, i know these accesses are coalesced until threads of are reading sequentially (k-th thread reads k-th word in global mem segment).

But as stated in Nvida OpenCL Best Practices Guide (page 13):

On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must

access the k-th word in a segment aligned to 16 times the size of the elements

being accessed; however, not all threads need to participate.

So interpreting the last sentence right if only lets say the first 4 threads reading the first 4 segments of global memory we have a coalesced access?

I’m asking because i’m working on a kernel where i need to load different blocks of global memory into local memory.

...

	//loading current block into local memory

	int pos_in_loc = local_id_x + local_id_y * local_width * 2;

	int pos_in_inp = global_id_x + global_id_y * input_width;

	loc[pos_in_loc] = input_mat[pos_in_inp];

	//loading next block on X-Axis from global to local

	if(local_id_x < 2*padding)

	{

		pos_in_loc = local_id_x + local_width + local_id_y * local_width * 2;

		pos_in_inp = global_id_x + local_width + global_id_y * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	//loading next block on Y-Axis from global to local

	if(local_id_y < 2*padding)

	{	

		pos_in_loc = local_id_x + (local_id_y + local_height) * local_width * 2;

		pos_in_inp = global_id_x + (local_height + global_id_y) * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	//loading next block on X/Y-Axis from global to local

	if(local_id_x < 2*padding && local_id_y < 2*padding)

	{

		pos_in_loc = local_id_x + local_width + (local_id_y + local_height) * local_width * 2;

		pos_in_inp = global_id_x + local_width + (local_height + global_id_y) * input_width;

		loc[pos_in_loc] = input_mat[pos_in_inp];

	}

	barrier(CLK_LOCAL_MEM_FENCE);

...

loc = local memory (size = work-group_widthwork_group_height4*sizeof(float))

input_mat = global memory (512x512)

The kernel execution needs 3,46 ms (average). When i’m removing if statements loading is block-wise and therefore coalesced, average kernel execution is about 3,21 ms.

As average execution time is just slightly slower I assume that these reads are also coalesced and 0,25 ms are due if statements. (block_size 16)

Can block-wise reading cause any problems as there are no controls concerning input_mat borders?

Because now the last x and y blocks access areas of global memory that are not initialized. If tried the code (with commented ifs) on ati and nvidia cards and they working fine.

Topic		Replies	Views
Newbie question regarding global load CUDA Programming and Performance	2	1697	September 2, 2008
Coalesced Memory access related doubt CUDA Programming and Performance	13	2244	December 9, 2010
confusions about coalesce access CUDA Programming and Performance	3	4957	January 9, 2009
Coalesced Access to Global Memory CUDA Programming and Performance	2	1946	April 13, 2012
Is these way coalesced access? CUDA Programming and Performance	0	422	March 6, 2020
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1064	June 30, 2010
Coalescence CUDA Programming and Performance	3	829	January 9, 2018
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	587	December 15, 2012
Need help on non-coalesced access CUDA Programming and Performance	0	1167	May 9, 2009
coalescing problem CUDA Programming and Performance	4	1135	August 8, 2011

coalesced access to global memory block-wise access vs element-wise access

Related topics