Thread conflicts in stencil computations

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:
global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];

}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:
global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];

}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?

Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?

Thanks, it did the job :) There is definitely an improvement in speed…

-DC

Thanks, it did the job :) There is definitely an improvement in speed…

-DC

If you use texture memory for VEC, you may get still better performance.

Raghu

If you use texture memory for VEC, you may get still better performance.

Raghu

You’re right for Compute 1.3 and earlier, but for Fermi, you have automatic L1 caching so textures probably won’t help.

It’d be interesting to measure though.

You’re right for Compute 1.3 and earlier, but for Fermi, you have automatic L1 caching so textures probably won’t help.

It’d be interesting to measure though.