I have a simple CUDA code that will eventually operate on a array of vectors, finding a vector that is the average for the array. I am having trouble getting the threads to operate properly in the GPU with my programming scheme. I have pasted an example code below. What I am doing here is to define a 16x16 block of threads. Each thread steps through one 16 element vector, and increments an accumulator for that element. I have set up the accumulator in shared memory to speed things up. When the code finishes each vector element in the shared array should contain the number 16x16 = 256. When I run the code in Emulation mode this is exactly what I get. However, when I run the code in the GPU I get only 1.0’s in each bin. If I place the __syncthreads() function at different locations I get random 2’s and 3’s scattered in the list, but never the right answer. What am I doing wrong here? My GPU processor is the Quadro FX 1600M.
// following definitions are made in the calling function:
typedef struct {
int height;
float* elements;
} Vector;
Vector vect;
vect.height = 16;
vect.elements = (float *)malloc( vect.height * sizeof(float) );
void AvgSpectFunc(const Matrix Cube, Vector vect)
{
Vector d_vect;
d_vect.height = vect.height;
size_t sizeVect = d_vect.height * sizeof(float);
cudaMalloc((void**) &d_vect.elements, sizeVect);
// Invoke a very simple kernel with 16x16 threads
dim3 dimBlock( 16,16 );
dim3 dimGrid( 1,1 );
TestKernel<<<dimGrid, dimBlock>>>(d_Cube, d_vect); // note: d_Cube is not used here
// Show the results:
cudaMemcpy(vect.elements, d_vect.elements, sizeVect, cudaMemcpyDeviceToHost);
printf("vect return:\n" );
for( int j=0; j<16; j++ ) printf("%.1f \n", vect.elements[j] );
printf("\n" );
// Free device memory
cudaFree(d_vect.elements);
}
// Kernel Function:
global void TestKernel(const Matrix, Vector);
global void TestKernel(const Matrix Cube, Vector vect)
{
shared float sVect[16];
for(int k=0; k<16; k++) sVect[k] = 0;
__syncthreads();
int ii = threadIdx.x;
int jj = threadIdx.y;
for (int ww=0; ww<vect.height; ww++)
{
sVect[ww] += 1; // generic thread
}
__syncthreads();
//At end of all threads write shared memory to the return memory
for(int k=0; k<vect.height; k++) vect.elements[k] = sVect[k];
}