kelson
1
Hi,
I would like to know if the following program has something wrong.
I’m trying to do some simple scan but i can’t get the correct result
global void reduceKernel1(float* _iarray, float* _oarray,int N)
{
//shared memory array
extern shared float sdata;
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
//each thread load data to shared memory
if(i < blockDim.x){
sdata[tid] = _iarray[i];
}
//synchronization
__syncthreads();
//loop over sdata
for(int j = 1; j < blockDim.x; j *= 2)
{
if(tid % (2*j) == 0)
{
sdata[tid] += sdata[tid + j] ;
__syncthreads();
}
__syncthreads();
}
//write back the data to the output array
if(tid < ARRAYSIZE )
_oarray[i] = sdata[tid];
__syncthreads();
}
Thank you four your help
syoon
2
my experience is that setting memory size for the shared memory before calling kernel is one factor for errors.
syoon
3
my experience is that setting memory size for the shared memory before calling kernel is one factor for errors.
kelson
4
Sir syoon ,
Thank you for your reply.
Actually I’m trying not to set the shared memory size. This is why it is declared with the extern keyword.
I don’t know if I missunderstand some principles of cuda.
kelson
5
Sir syoon ,
Thank you for your reply.
Actually I’m trying not to set the shared memory size. This is why it is declared with the extern keyword.
I don’t know if I missunderstand some principles of cuda.
[codebox]
if(tid % (2*j) == 0)
{
sdata[tid] += sdata[tid + j] ;
__syncthreads();
}
__syncthreads();[/codebox]
Having a __syncthreads() within an “if” block that doesn’t evaluate the same for all threads can have unexpected side effects.
[codebox]
if(tid % (2*j) == 0)
{
sdata[tid] += sdata[tid + j] ;
__syncthreads();
}
__syncthreads();[/codebox]
Having a __syncthreads() within an “if” block that doesn’t evaluate the same for all threads can have unexpected side effects.
syoon
8
I think you already know more than I do…Just in case…I had some problems with shared memory before…
The following is from ‘tera’ on my question on shared memory usage…
==============
the size must be known at compile time.
What you can do is use the dynamical shared memory allocation feature. Declare a_d as follows:
extern shared float a_d;
and add the required size as third configuration parameter of the kernel invocation:
my_kernel<<<gridsize, blocksize, blocksize.xblocksize.yblocksize.z*sizeof(float)>>>();
Note that this only works for one variable size array.
==============
if you are sure you did right on this part, please disregard my reply.
with ‘tera’ 's help i am now doing fine using dynamic shared memory…
syoon
9
I think you already know more than I do…Just in case…I had some problems with shared memory before…
The following is from ‘tera’ on my question on shared memory usage…
==============
the size must be known at compile time.
What you can do is use the dynamical shared memory allocation feature. Declare a_d as follows:
extern shared float a_d;
and add the required size as third configuration parameter of the kernel invocation:
my_kernel<<<gridsize, blocksize, blocksize.xblocksize.yblocksize.z*sizeof(float)>>>();
Note that this only works for one variable size array.
==============
if you are sure you did right on this part, please disregard my reply.
with ‘tera’ 's help i am now doing fine using dynamic shared memory…
tera
10
How do you invoke the kernel - can you post that line of code as well?
tera
11
How do you invoke the kernel - can you post that line of code as well?
kelson
12
Hi sirs,
Thank for your help.
You are completely right. I declared the shared memory external but
was missing to add the required size during the kernel call.
before: reduceKernel1 <<< blocks, threads >>> ();
after: reduceKernel1 <<< blocks, threads ,ARRAYSIZE * sizeof(float)>>> ();
Now everything works normally.
Thank you!
kelson
13
Hi sirs,
Thank for your help.
You are completely right. I declared the shared memory external but
was missing to add the required size during the kernel call.
before: reduceKernel1 <<< blocks, threads >>> ();
after: reduceKernel1 <<< blocks, threads ,ARRAYSIZE * sizeof(float)>>> ();
Now everything works normally.
Thank you!