Simple Thread Problem

I have a simple CUDA code that will eventually operate on a array of vectors, finding a vector that is the average for the array. I am having trouble getting the threads to operate properly in the GPU with my programming scheme. I have pasted an example code below. What I am doing here is to define a 16x16 block of threads. Each thread steps through one 16 element vector, and increments an accumulator for that element. I have set up the accumulator in shared memory to speed things up. When the code finishes each vector element in the shared array should contain the number 16x16 = 256. When I run the code in Emulation mode this is exactly what I get. However, when I run the code in the GPU I get only 1.0’s in each bin. If I place the __syncthreads() function at different locations I get random 2’s and 3’s scattered in the list, but never the right answer. What am I doing wrong here? My GPU processor is the Quadro FX 1600M.

// following definitions are made in the calling function:
typedef struct {
int height;
float* elements;
} Vector;

Vector vect;
vect.height = 16;
vect.elements = (float *)malloc( vect.height * sizeof(float) );

void AvgSpectFunc(const Matrix Cube, Vector vect)

Vector d_vect;
d_vect.height = vect.height;

size_t sizeVect = d_vect.height * sizeof(float);
cudaMalloc((void**) &d_vect.elements, sizeVect);

// Invoke a very simple kernel with 16x16 threads

dim3 dimBlock( 16,16 );
dim3 dimGrid( 1,1 );

TestKernel<<<dimGrid, dimBlock>>>(d_Cube, d_vect); // note: d_Cube is not used here

// Show the results:
cudaMemcpy(vect.elements, d_vect.elements, sizeVect, cudaMemcpyDeviceToHost);

printf("vect return:\n" );
for( int j=0; j<16; j++ ) printf("%.1f \n", vect.elements[j] );
printf("\n" );

// Free device memory

// Kernel Function:

global void TestKernel(const Matrix, Vector);

global void TestKernel(const Matrix Cube, Vector vect)
shared float sVect[16];

for(int k=0; k<16; k++) sVect[k] = 0;

int ii = threadIdx.x;
int jj = threadIdx.y;
for (int ww=0; ww<vect.height; ww++)
	  sVect[ww] += 1;	// generic thread

//At end of all threads write shared memory to the return memory
for(int k=0; k<vect.height; k++) vect.elements[k] = sVect[k];


I didnt look too dip at the algorithm but your usage of shared mem is totaly wrong. I suggest you try to

re-read the programming guide or some tutorials.

Basicaly the line sVect[ww] += 1 (and others) will result in all threads in the block race-conditioning

to update the same position in the array and eventually write a corrupted value back to the smem array.

The reason emulation code works is because in emulation code actually runs in a serial way one thread after the other.