Hello everyone, I have never had a deep understanding of coalesced access. Does it count as non-coalesced access if a thread accesses non-contiguous memory spaces? For example, in the following code, if a thread needs to access the non-contiguous memory spaces of d_ini, is this considered coalesced access or non-coalesced access?
global void slope1Kernel(float *d_ini, float aa, float kappa)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
float phi, phix;
if (i > 4 && i < imax + 4 && j > 4 && j < jmax + 4 && k > 4 && k < kmax + 4)
{
So accesses with the +i at the end of the index probably have generally good coalescing properties.
It also depends on alignment.
Generally you are throwing away bandwidth with your if clause.
Example with one dimension:
instead of if (i > 4 && i < imax + 4) d_ini[i]
you can do if (i < imax - 1) d_ini[i+5]
by reinterpreting the i.
You should try that each thread within a warp participates.
Thank you for your reply. You’ve provided me with a new indexing approach, and I will test it in my program later. However, this indexing method seems more suitable for cases with smaller dimensions. If imax > 1024, this indexing won’t be usable.
Yes, with indexing imax-1 > 1024 or > 1536 or > 2048 (depending on GPU) you cannot have as many threads per block.
You can choose an indexing scheme, which profits caching by thinking about which data is needed by which indices and put them into the same block. But that is a secondary optimization after getting coalescing right and use every thread (or as many as possible).