Coalesced access

Hello everyone, I have never had a deep understanding of coalesced access. Does it count as non-coalesced access if a thread accesses non-contiguous memory spaces? For example, in the following code, if a thread needs to access the non-contiguous memory spaces of d_ini, is this considered coalesced access or non-coalesced access?

global void slope1Kernel(float *d_ini, float aa, float kappa)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
float phi, phix;
if (i > 4 && i < imax + 4 && j > 4 && j < jmax + 4 && k > 4 && k < kmax + 4)
{

	phi = d_ini[4 * nkmax*njmax*nimax + k * njmax*nimax + j * nimax + i];
	phix = d_ini[4 * nkmax*njmax*nimax + k * njmax*nimax + j * nimax + i + 1] + d_ini[4 * nkmax*njmax*nimax + k * njmax*nimax + j * nimax + i - 1] + d_ini[4 * nkmax*njmax*nimax + k * njmax*nimax + (j + 1) * nimax + i] + d_ini[4 * nkmax*njmax*nimax + k * njmax*nimax + (j - 1) * nimax + i]+ d_ini[4 * nkmax*njmax*nimax + (k + 1) * njmax*nimax + j * nimax + i] + d_ini[4 * nkmax*njmax*nimax + (k - 1) * njmax*nimax + j * nimax + i] - 6.0*phi;
	d_ini[5 * nkmax*njmax*nimax + k * njmax*nimax + j * nimax + i] = aa * phi * (phi - 1.0f) * (2.0f * phi - 1.0f) - kappa * phix;


}

}

Coalescing is between threads of a warp.

So accesses with the +i at the end of the index probably have generally good coalescing properties.

It also depends on alignment.

Generally you are throwing away bandwidth with your if clause.
Example with one dimension:
instead of
if (i > 4 && i < imax + 4) d_ini[i]
you can do
if (i < imax - 1) d_ini[i+5]
by reinterpreting the i.
You should try that each thread within a warp participates.

1 Like

Thank you for your response. Do you think it’s necessary to change a 3D grid into a 1D grid? Would this have a positive impact on memory access?

Hi 2249884324,
it depends a bit on the size of your dimensions.

you have a 3D block size and a 3D grid size and combine each dimension.

For your accesses it would make sense to let threadIdx.x use full warps (i.e. 0…31) or multiples.

So perhaps rather something like:

int i = threadIdx.x;
int j = blockIdx.x * blockDim.y + threadIdx.y;
int k = blockIdx.y;

and use the first block dimension for i, the second block dimension and first grid dimension for j and the second grid dimension for k.

And instead of letting i, j, k effectively start at 5, make them go from

0…imax-2, 0…jmax-2, 0…kmax-2

if (i < imax -1 && j < jmax - 1 && k < kmax - 1)
{
   phi = d_ini[4 * nkmax*njmax*nimax + (k+5) * njmax*nimax + (j+5) * nimax + (i+5)];
}

If you make nkmax, njmax and nimax compile-time constexpr or template parameters, the index calculation would be simplified by the compiler.

1 Like

Thank you for your reply. You’ve provided me with a new indexing approach, and I will test it in my program later. However, this indexing method seems more suitable for cases with smaller dimensions. If imax > 1024, this indexing won’t be usable.

Yes, with indexing imax-1 > 1024 or > 1536 or > 2048 (depending on GPU) you cannot have as many threads per block.

You can choose an indexing scheme, which profits caching by thinking about which data is needed by which indices and put them into the same block. But that is a secondary optimization after getting coalescing right and use every thread (or as many as possible).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.