Help Avoiding Un-Coalesced Memory Access

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}

Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

I am looking for some suggestions on good programming patterns in CUDA for avoiding un-coalesced memory access.

The kernel I am trying to optimize is as follows:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

	int j = threadIdx.y + blockIdx.y * blockDim.y;

	x[IX(  0, i  )] = x[IX(1, i)];

	x[IX(DIM-1, i  )] = x[IX(DIM-2, i)];

	x[IX(  i, 0  )] = x[IX(i, 1)];

	x[IX(  i, DIM-1)] = x[IX(i, DIM-2)];

}

Basically, I am setting my boundary to the value just inside. Using the Compute Visual Profiler I have determined this is a major hot-spot in my program. I understand that I am making four global memory accesses which are both slow and un-cached. One option I am pursuing is trying to make my accesses tex1Dfetch’s so at least they would be cached. However, then I cannot pass in a pointer which makes my kernel significantly less re-usable. I could put an if statement to select from my textures but then I am going to incur branching penalties.

Just wondering if the pro’s might be able to recommend what I could do in a situation such as this?

Thanks!

Just realized I had a huge newbie mistake:

The kernel as originally presented would be setting values over and over even when the column or row was not correct. By adding a check to make sure the values are only read/wrote on the correct row and column this can be significantly sped up (i.e. not totally broken). :">

Just realized I had a huge newbie mistake:

The kernel as originally presented would be setting values over and over even when the column or row was not correct. By adding a check to make sure the values are only read/wrote on the correct row and column this can be significantly sped up (i.e. not totally broken). :">

You could try to first read (coalesced) to shared memory and then to do a (coalesced) write.

The reads and writes to, let us say, the vertical insides/edges would never be coalesced as far as I can see, so I would not spend code on copying from, say (1,i) to (0,i).

You might have to declare shared memory dynamically, see section B.2.3 op the programming guide.

Also I wonder what happens when writing your edgepoints? Could they be written to by their neighbours at more or less the same time?

You could try to first read (coalesced) to shared memory and then to do a (coalesced) write.

The reads and writes to, let us say, the vertical insides/edges would never be coalesced as far as I can see, so I would not spend code on copying from, say (1,i) to (0,i).

You might have to declare shared memory dynamically, see section B.2.3 op the programming guide.

Also I wonder what happens when writing your edgepoints? Could they be written to by their neighbours at more or less the same time?

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

		int j = threadIdx.y + blockIdx.y * blockDim.y;

	if (i == 0) x[IX(  0, i  )] = tex1D(tex_ref, IX(1,i));

	if (i == DIM-1) x[IX(DIM-1, i  )] = tex1D(tex_ref, IX(DIM-2,i));

	if (j == 0) x[IX(  i, 0  )] = tex1D(tex_ref, IX(i,1));

	if (j == DIM-1) x[IX(  i, DIM-1)] = tex1D(tex_ref, IX(i,DIM-2));

}

Before I invoke this kernel I set tex_ref (a 1D float texture reference) by calling bind to whatever global memory *x is referring to at the time.

Empirically, this is giving me a big speed boost in my specific application. Additionally, the if statements prevent anything from happening unless necessary. Having all that branching probably is a bit of a speed hit, so Ill look into your idea(s) about how this could be done in a more elegant way.

I am still getting uncoalesced writes however, honestly I dont even know where to begin on fixing that and would appreciate any guidance. Thanks again!

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

__global__ void set_bnd_kernel_0(float *x)

{

	int i = threadIdx.x + blockIdx.x * blockDim.x;

		int j = threadIdx.y + blockIdx.y * blockDim.y;

	if (i == 0) x[IX(  0, i  )] = tex1D(tex_ref, IX(1,i));

	if (i == DIM-1) x[IX(DIM-1, i  )] = tex1D(tex_ref, IX(DIM-2,i));

	if (j == 0) x[IX(  i, 0  )] = tex1D(tex_ref, IX(i,1));

	if (j == DIM-1) x[IX(  i, DIM-1)] = tex1D(tex_ref, IX(i,DIM-2));

}

Before I invoke this kernel I set tex_ref (a 1D float texture reference) by calling bind to whatever global memory *x is referring to at the time.

Empirically, this is giving me a big speed boost in my specific application. Additionally, the if statements prevent anything from happening unless necessary. Having all that branching probably is a bit of a speed hit, so Ill look into your idea(s) about how this could be done in a more elegant way.

I am still getting uncoalesced writes however, honestly I dont even know where to begin on fixing that and would appreciate any guidance. Thanks again!

[quote name=‘rmonette’ post=‘1126283’ date=‘Oct 4 2010, 08:57 PM’]

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

[codebox]for (i=0;i<DIM;i++)

for (j=0;j<DIM;j++)

{

if (i==0) ....

else if (i==DIM-1) .....

if (j==0) ...

else if (j==DIM-1) ....

}[/codebox]

Probably you would just work with the inner loops:

[codebox]for (i=0;i<DIM;i++)

{

x(0,i)=x(1,i)

x(DIM-1,i)=x(DIM-2,i)

}

for (j=0;j<DIM;j++)

{

x(j,0)=x(j,1)

x(j,DIM-1)=x(j,DIM-2)

}[/codebox]

Assuming a square array, I would use DIM threads, and divide these threads into blocks so that each warp (32 thread) reads 256 bytes which are stored in shared memory, and then written to global memory again. This is repeated and then the second loop is done without shared memory.

So:

[codebox]global void doborders(float *x)

{

shared float s_x[32];

int ind=blockIdx.x*gridDim.x+threadIdx.x;

s_x=x[DIM+ind];

__syncthreads();

x[ind]=s_x;

// similar for bottom

ind+=(DIM-2)*DIM;

s_x=x[ind];

__syncthreads();

x[DIM+ind]=s_x;

// column 0…

if (ind==0 || ind==DIM-1) return;

x[indDIM]=x[indDIM+1];

x[indDIM+DIM-1]=x[indDIM+DIM-2];

}

void calldoborders(float* x)

{

doborders<<<DIM/32,32>>>(x);

}

[/codebox]

The use of shared memory should take care of coalescing, in so far possible.

I haven’t tried this out, just typed it in the codeboxes.

Also, I have no idea if this does better than your texture approach.

If you try it, please tell me how it works out.

[quote name=‘rmonette’ post=‘1126283’ date=‘Oct 4 2010, 08:57 PM’]

So I’ve been able to get rid of “gld uncoalesced” hits in Computer Visual Profiler with the following:

[codebox]for (i=0;i<DIM;i++)

for (j=0;j<DIM;j++)

{

if (i==0) ....

else if (i==DIM-1) .....

if (j==0) ...

else if (j==DIM-1) ....

}[/codebox]

Probably you would just work with the inner loops:

[codebox]for (i=0;i<DIM;i++)

{

x(0,i)=x(1,i)

x(DIM-1,i)=x(DIM-2,i)

}

for (j=0;j<DIM;j++)

{

x(j,0)=x(j,1)

x(j,DIM-1)=x(j,DIM-2)

}[/codebox]

Assuming a square array, I would use DIM threads, and divide these threads into blocks so that each warp (32 thread) reads 256 bytes which are stored in shared memory, and then written to global memory again. This is repeated and then the second loop is done without shared memory.

So:

[codebox]global void doborders(float *x)

{

shared float s_x[32];

int ind=blockIdx.x*gridDim.x+threadIdx.x;

s_x=x[DIM+ind];

__syncthreads();

x[ind]=s_x;

// similar for bottom

ind+=(DIM-2)*DIM;

s_x=x[ind];

__syncthreads();

x[DIM+ind]=s_x;

// column 0…

if (ind==0 || ind==DIM-1) return;

x[indDIM]=x[indDIM+1];

x[indDIM+DIM-1]=x[indDIM+DIM-2];

}

void calldoborders(float* x)

{

doborders<<<DIM/32,32>>>(x);

}

[/codebox]

The use of shared memory should take care of coalescing, in so far possible.

I haven’t tried this out, just typed it in the codeboxes.

Also, I have no idea if this does better than your texture approach.

If you try it, please tell me how it works out.