Hi all,
after leraning the reduction sdk i tried to write a simple porgram, to get a small image from texture memory into shared memory, then reading the pixels from shared memory to generate a new (same) image.
Now i have some trouble with the threads management in the kernel:
I’ve got different results by several new executions.
following are kernel codes:
template <class T> __global__ void CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){
SharedMemory<T> l_smem;
T *sdata_l = l_smem.getPointer();
int smemSize = (imgW*imgH)*sizeof(T);
const int ix = threadIdx.x;
const int iy = blockIdx.x;
const float x = (float)ix + 0.5f;
const float y = (float)iy + 0.5f;
//each thread loads one element from texture mem to shared mem
float4 clr00 = tex2D(texImage, x, y);
sdata_l[imgW * iy + ix] = clr00.x;
__syncthreads();
//Rewriting pixels to dst
if(ix < imgW && iy < imgH){
//d_dst[imgW * iy + ix] = make_color1(clr00.x);
d_dst[imgW * iy + ix] = sdata_l[imgW * iy + ix];
};
}
template <class T> void cuda_CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){
dim3 threads(imgW, 1, 1);
dim3 grid(imgH, 1, 1);
CostCal<T><<<grid, threads>>>(d_dst, d_src, imgW, imgH);
}
extern "C" void cuda_CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){
cuda_CostCal<int>(d_dst, d_src, imgW, imgH);
}
If I run
d_dst[imgW * iy + ix] = make_color1(clr00.x);
without the copy from texture mem to shared mem, i got the result I want as follows:
If I use codes like above, i got more fault results like:
or
or
… who can tell me, why? How can I correct my codes…thx!!