I have an array VEC in global memory and the following operation needs to be performed:
VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2
We can write 2 kernels to perform the above operation: global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];
}
global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];
}
Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:
global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];
// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;
}
Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.
I have an array VEC in global memory and the following operation needs to be performed:
VEC[i] = VEC[i-1] + VEC[i+1] for i = 1 to N-2
We can write 2 kernels to perform the above operation: global void kernel1 ( float* temp, float *VEC, int N)
{
if ( TID > 1 && TID < N-1 )
temp[ TID ] = VEC[TID-1] + VEC[TID + 1];
}
global void kernel2(float temp, float VEC, int N)
{
if ( TID > 1 && TID < N-1 )
VEC[TID] = temp[TID];
}
Calling kernel1 and then kernel2 consecutively will yield the desired result. But its more expensive as two kernels are called.
I would just like to use one kernel like this:
global void kernel3( float* VEC, int N)
{
// Option 1
// This wont work and might give different results each time the kernel is called
// Some threads might modify the values at neighboring threads concurrently
VEC[TID] = VEC[TID-1] + VEC[TID+1];
// Option 2
// Try copying to register. Works sometimes
float temp = VEC[TID-1] + VEC[TID+1];
VEC[TID] = temp;
}
Is there any other way that will not give rise to thread conflicts, such as in kernel3 ? Any ideas will be appreciated.
Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?
Each thread does one vector. It copies both adjacent vectors to automatic or shared vectors, then calls __syncthreads() to be sure that you wait with writing the summed results until all reading has been done? This works as long as all needed threads can go in one block. If not, you might need __threadfence(), AFAIK. I’m not sure this is worth writing a kernel for, but if your data are already in GPU, then, why not?