Help Avoiding Un-Coalesced Memory Access

rmonette · October 4, 2010, 4:27pm

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}

Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

rmonette · October 4, 2010, 4:27pm

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}

Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

rmonette · October 4, 2010, 5:31pm

Just realized I had a huge newbie mistake:

The kernel as originally presented would be setting values over and over even when the column or row was not correct. By adding a check to make sure the values are only read/wrote on the correct row and column this can be significantly sped up (i.e. not totally broken). :">

rmonette · October 4, 2010, 5:31pm

Just realized I had a huge newbie mistake:

The kernel as originally presented would be setting values over and over even when the column or row was not correct. By adding a check to make sure the values are only read/wrote on the correct row and column this can be significantly sped up (i.e. not totally broken). :">

jan.heckman · October 4, 2010, 5:31pm

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:
__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}
Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

You could try to first read (coalesced) to shared memory and then to do a (coalesced) write.

The reads and writes to, let us say, the vertical insides/edges would never be coalesced as far as I can see, so I would not spend code on copying from, say (1,i) to (0,i).

You might have to declare shared memory dynamically, see section B.2.3 op the programming guide.

Also I wonder what happens when writing your edgepoints? Could they be written to by their neighbours at more or less the same time?

jan.heckman · October 4, 2010, 5:31pm

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:
__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}
Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

You could try to first read (coalesced) to shared memory and then to do a (coalesced) write.

The reads and writes to, let us say, the vertical insides/edges would never be coalesced as far as I can see, so I would not spend code on copying from, say (1,i) to (0,i).

You might have to declare shared memory dynamically, see section B.2.3 op the programming guide.

Also I wonder what happens when writing your edgepoints? Could they be written to by their neighbours at more or less the same time?

rmonette · October 4, 2010, 6:57pm

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

		int j = threadIdx.y + blockIdx.y * blockDim.y;

	if (i == 0) x[IX(  0, i  )] = tex1D(tex_ref, IX(1,i));

	if (i == DIM-1) x[IX(DIM-1, i  )] = tex1D(tex_ref, IX(DIM-2,i));

	if (j == 0) x[IX(  i, 0  )] = tex1D(tex_ref, IX(i,1));

	if (j == DIM-1) x[IX(  i, DIM-1)] = tex1D(tex_ref, IX(i,DIM-2));

}

Before I invoke this kernel I set tex_ref (a 1D float texture reference) by calling bind to whatever global memory *x is referring to at the time.

Empirically, this is giving me a big speed boost in my specific application. Additionally, the if statements prevent anything from happening unless necessary. Having all that branching probably is a bit of a speed hit, so Ill look into your idea(s) about how this could be done in a more elegant way.

I am still getting uncoalesced writes however, honestly I dont even know where to begin on fixing that and would appreciate any guidance. Thanks again!

rmonette · October 4, 2010, 6:57pm

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

		int j = threadIdx.y + blockIdx.y * blockDim.y;

	if (i == 0) x[IX(  0, i  )] = tex1D(tex_ref, IX(1,i));

	if (i == DIM-1) x[IX(DIM-1, i  )] = tex1D(tex_ref, IX(DIM-2,i));

	if (j == 0) x[IX(  i, 0  )] = tex1D(tex_ref, IX(i,1));

	if (j == DIM-1) x[IX(  i, DIM-1)] = tex1D(tex_ref, IX(i,DIM-2));

}

Before I invoke this kernel I set tex_ref (a 1D float texture reference) by calling bind to whatever global memory *x is referring to at the time.

Empirically, this is giving me a big speed boost in my specific application. Additionally, the if statements prevent anything from happening unless necessary. Having all that branching probably is a bit of a speed hit, so Ill look into your idea(s) about how this could be done in a more elegant way.

I am still getting uncoalesced writes however, honestly I dont even know where to begin on fixing that and would appreciate any guidance. Thanks again!

jan.heckman · October 4, 2010, 8:56pm

[quote name=‘rmonette’ post=‘1126283’ date=‘Oct 4 2010, 08:57 PM’]

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

[codebox]for (i=0;i<DIM;i++)

for (j=0;j<DIM;j++)

{

if (i==0) ....

else if (i==DIM-1) .....

if (j==0) ...

else if (j==DIM-1) ....

}[/codebox]

Probably you would just work with the inner loops:

[codebox]for (i=0;i<DIM;i++)

{

x(0,i)=x(1,i)

x(DIM-1,i)=x(DIM-2,i)

}

for (j=0;j<DIM;j++)

{

x(j,0)=x(j,1)

x(j,DIM-1)=x(j,DIM-2)

}[/codebox]

Assuming a square array, I would use DIM threads, and divide these threads into blocks so that each warp (32 thread) reads 256 bytes which are stored in shared memory, and then written to global memory again. This is repeated and then the second loop is done without shared memory.

So:

[codebox]global void doborders(float *x)

{

shared float s_x[32];

int ind=blockIdx.x*gridDim.x+threadIdx.x;

s_x=x[DIM+ind];

__syncthreads();

x[ind]=s_x;

// similar for bottom

ind+=(DIM-2)*DIM;

s_x=x[ind];

__syncthreads();

x[DIM+ind]=s_x;

// column 0…

if (ind==0 || ind==DIM-1) return;

x[indDIM]=x[indDIM+1];

x[indDIM+DIM-1]=x[indDIM+DIM-2];

}

void calldoborders(float* x)

{

…

doborders<<<DIM/32,32>>>(x);

…

}

[/codebox]

The use of shared memory should take care of coalescing, in so far possible.

I haven’t tried this out, just typed it in the codeboxes.

Also, I have no idea if this does better than your texture approach.

If you try it, please tell me how it works out.

jan.heckman · October 4, 2010, 8:56pm

[quote name=‘rmonette’ post=‘1126283’ date=‘Oct 4 2010, 08:57 PM’]

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

[codebox]for (i=0;i<DIM;i++)

for (j=0;j<DIM;j++)

{

if (i==0) ....

else if (i==DIM-1) .....

if (j==0) ...

else if (j==DIM-1) ....

}[/codebox]

Probably you would just work with the inner loops:

[codebox]for (i=0;i<DIM;i++)

{

x(0,i)=x(1,i)

x(DIM-1,i)=x(DIM-2,i)

}

for (j=0;j<DIM;j++)

{

x(j,0)=x(j,1)

x(j,DIM-1)=x(j,DIM-2)

}[/codebox]

Assuming a square array, I would use DIM threads, and divide these threads into blocks so that each warp (32 thread) reads 256 bytes which are stored in shared memory, and then written to global memory again. This is repeated and then the second loop is done without shared memory.

So:

[codebox]global void doborders(float *x)

{

shared float s_x[32];

int ind=blockIdx.x*gridDim.x+threadIdx.x;

s_x=x[DIM+ind];

__syncthreads();

x[ind]=s_x;

// similar for bottom

ind+=(DIM-2)*DIM;

s_x=x[ind];

__syncthreads();

x[DIM+ind]=s_x;

// column 0…

if (ind==0 || ind==DIM-1) return;

x[indDIM]=x[indDIM+1];

x[indDIM+DIM-1]=x[indDIM+DIM-2];

}

void calldoborders(float* x)

{

…

doborders<<<DIM/32,32>>>(x);

…

}

[/codebox]

The use of shared memory should take care of coalescing, in so far possible.

I haven’t tried this out, just typed it in the codeboxes.

Also, I have no idea if this does better than your texture approach.

If you try it, please tell me how it works out.

Topic		Replies	Views
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1050	June 30, 2010
Coalesced memory access pattern in Image processing CUDA Programming and Performance	2	1007	March 22, 2013
Shared memory question CUDA Programming and Performance	27	7597	June 23, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16769	March 31, 2011
performance unchanged by shared memory 1D heat equation attempt CUDA Programming and Performance	15	5001	July 12, 2009
Problems with coalescing memory accesses CUDA Programming and Performance	4	3818	August 26, 2009
Coalesced shared memory access? Read and write from which thread to which? CUDA Programming and Performance	23	2674	May 22, 2014
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1224	March 30, 2009
Coalesced memory access example CUDA Programming and Performance	2	3346	March 28, 2011
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2288	May 28, 2009

Help Avoiding Un-Coalesced Memory Access

Related topics