Thread conflicts in stencil computations

Dominic_Chandar · October 10, 2010, 5:16am

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:
global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];

}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

Dominic_Chandar · October 10, 2010, 5:16am

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:
global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];

}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

avidday · October 10, 2010, 9:21am

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

avidday · October 10, 2010, 9:21am

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

jan.heckman · October 10, 2010, 9:52pm

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:

global void kernel1 ( float* temp, float *VEC, int N)

{
if ( TID > 1 && TID < N-1 )

 temp[ TID ] = VEC[TID-1] + VEC[TID + 1];
}

global void kernel2(float temp, float VEC, int N)

{
if ( TID > 1 && TID < N-1 )

VEC[TID] = temp[TID];
}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.

I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)

{

// Option 1

// This wont work and might give different results each time the kernel is called

// Some threads might modify the values at neighboring threads concurrently

VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2

// Try copying to register. Works sometimes

float temp = VEC[TID-1] + VEC[TID+1];

VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?

jan.heckman · October 10, 2010, 9:52pm

Hi,

I have an array VEC in global memory and the following operation needs to be performed:

VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2

We can write 2 kernels to perform the above operation:

global void kernel1 ( float* temp, float *VEC, int N)

{
if ( TID > 1 && TID < N-1 )

 temp[ TID ] = VEC[TID-1] + VEC[TID + 1];
}

global void kernel2(float temp, float VEC, int N)

{
if ( TID > 1 && TID < N-1 )

VEC[TID] = temp[TID];
}

Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.

I would just like to use one kernel like this:

global void kernel3( float* VEC, int N)

{

// Option 1

// This wont work and might give different results each time the kernel is called

// Some threads might modify the values at neighboring threads concurrently

VEC[TID] = VEC[TID-1] + VEC[TID+1];

// Option 2

// Try copying to register. Works sometimes

float temp = VEC[TID-1] + VEC[TID+1];

VEC[TID] = temp;

}

Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.

-DC

Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?

Dominic_Chandar · October 11, 2010, 5:54pm

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

Thanks, it did the job :) There is definitely an improvement in speed…

-DC

Dominic_Chandar · October 11, 2010, 5:54pm

err, why not do something like this:

__global__ void kernel1 ( float* temp, float *VEC, int N)

{

if ( TID > 1 && TID < N-1 )

temp[ TID ] = VEC[TID-1] + VEC[TID + 1];

}

.....

kernel1 <<<blocks,threadsperblock>>> (temp, VEC, N);

float * swap = VEC; VEC = temp; temp = swap;

Thanks, it did the job :) There is definitely an improvement in speed…

-DC

raghu · October 12, 2010, 3:15pm

If you use texture memory for VEC, you may get still better performance.

Raghu

raghu · October 12, 2010, 3:15pm

If you use texture memory for VEC, you may get still better performance.

Raghu

SPWorley · October 13, 2010, 11:18pm

You’re right for Compute 1.3 and earlier, but for Fermi, you have automatic L1 caching so textures probably won’t help.

It’d be interesting to measure though.

SPWorley · October 13, 2010, 11:18pm

You’re right for Compute 1.3 and earlier, but for Fermi, you have automatic L1 caching so textures probably won’t help.

It’d be interesting to measure though.

Topic		Replies	Views
HELP with vector sum CUDA Programming and Performance	6	2224	May 11, 2010
Performance Question CUDA Programming and Performance	3	3314	May 25, 2011
Help with memory management CUDA Programming and Performance	20	5765	March 27, 2010
matrix multiply reduction CUDA Programming and Performance	41	35554	January 15, 2011
CUDA memory transactions CUDA Programming and Performance	9	8825	April 11, 2011
Slow Performance CUDA Programming and Performance	19	8127	November 24, 2008
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1845	November 1, 2010
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25626	July 5, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16602	March 31, 2011
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2264	July 19, 2010

Thread conflicts in stencil computations

Related topics