Loading from texture mem to shared mem Trouble with the threads

Hi all,

after leraning the reduction sdk i tried to write a simple porgram, to get a small image from texture memory into shared memory, then reading the pixels from shared memory to generate a new (same) image.

Now i have some trouble with the threads management in the kernel:

I’ve got different results by several new executions.

following are kernel codes:

template <class T> __global__ void CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){

	SharedMemory<T> l_smem;

	T *sdata_l = l_smem.getPointer();

	int smemSize = (imgW*imgH)*sizeof(T);

	const int ix = threadIdx.x;

	const int iy = blockIdx.x;

	const float x = (float)ix + 0.5f;

	const float y = (float)iy + 0.5f;

	//each thread loads one element from texture mem to shared mem

	float4 clr00 = tex2D(texImage, x, y);

	sdata_l[imgW * iy + ix] = clr00.x;


	//Rewriting pixels to dst

	if(ix < imgW && iy < imgH){

		//d_dst[imgW * iy + ix] = make_color1(clr00.x);

		d_dst[imgW * iy + ix] = sdata_l[imgW * iy + ix];  



template <class T> void cuda_CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){

	dim3 threads(imgW, 1, 1);

	dim3 grid(imgH, 1, 1);

	CostCal<T><<<grid, threads>>>(d_dst, d_src, imgW, imgH);


extern "C" void cuda_CostCal(TColor *d_dst, TColor *d_src, int imgW, int imgH){

	cuda_CostCal<int>(d_dst, d_src, imgW, imgH);


If I run

d_dst[imgW * iy + ix] = make_color1(clr00.x);

without the copy from texture mem to shared mem, i got the result I want as follows:


If I use codes like above, i got more fault results like:






… who can tell me, why? How can I correct my codes…thx!!

  1. Where do you allocate your shared memory? I don’t see any allocation anywhere

  2. Shared memory is allocated per block. So in your case you would most likely want to allocate imgW*bytesPerImageElement bytes of shared memory for each block. This memory is then usually accessed only using threadIdx (in your case sdata_l[threadIdx.x] ) … note that the maximum amount of shared memory per is 16kB per block (so it is actually impossible to allocate enough shared memory for your whole image)

yes, u are right. I had some mistake with the shared memory. now i ve got my result like this. thx for your help!

shared float4 sData_l[256];

sData_l[ix] = tex2D(texImage, ix, l);