Cuda shared memory within thread blocks problem

I posted a question earlier but it was a bit complicated so I broke it down to the most simple case of not working that I could manage. Below is a cuda program that takes in a number of blocks and a number of threads. Each thread in the block uses shared memory to set a 1 to its block of memory. Then the threads wait. The 0 thread will then iterate shared memory and add up all the ones and write the answer to global memory for each block. So in the end the program should print out the number of threads in each block. Simple enough right? Well it works for any number of threads in a block < 8. For any number of threads per block greater than 8 it only prints out 8. It has to be something simple but I’ve yet to figure it out. I am also relatively new to cuda. Below is the code and the make file

test.cu:

#include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <cuda.h>

__global__ void checkMatchOnDevice(int * blockArray)

{

	extern __shared__ int sdata[];

  unsigned int tid=threadIdx.x;

	sdata[tid] = 1;

	__syncthreads();

	

	if (tid==0)

	{

		for (unsigned int a=0;a<blockDim.x;a++)

			blockArray[blockIdx.x] += sdata[a];

	}

}

int main(int argc, char *argv[])

{

  int nBlocks;

  int threadsPerBlock;

  if(argc != 3)

  {

    printf("Usage: numBlocks threadsPerBlock\n");

    exit(0);

  }

  else

  {

    nBlocks = atoi(argv[1]);

    threadsPerBlock = atoi(argv[2]);

		int *blockArray;

		int *hostBlockArray;

		cudaMalloc((void **) &blockArray, sizeof(int)*nBlocks);

    hostBlockArray = (int*)malloc(sizeof(int)*nBlocks);

		for(int i = 0; i < nBlocks; i++)

    {

      hostBlockArray[i] = 0; 

    }

		cudaMemcpy(blockArray, hostBlockArray, sizeof(int)*nBlocks, cudaMemcpyHostToDevice);

    checkMatchOnDevice <<< nBlocks, threadsPerBlock >>> (blockArray);

    cudaThreadSynchronize();

		cudaMemcpy(hostBlockArray, blockArray, sizeof(int)*nBlocks, cudaMemcpyDeviceToHost);

    //cudaMemcpy(hostMatchStartArray, matchStartArray, sizeof(int)*numThreads, cudaMemcpyDeviceToHost);

    //cudaMemcpy(hostMatchEndArray, matchEndArray, sizeof(int)*numThreads, cudaMemcpyDeviceToHost);

		int total2=0;

		for(int i = 0; i < nBlocks; i++)

    {

			total2 += hostBlockArray[i];

      printf("%d)%d\n",i,hostBlockArray[i]);

    }

		printf("Total: %d\n", total2);

  }

}

Makefile:

################################################################################

#

# Copyright 1993-2006 NVIDIA Corporation.  All rights reserved.

#

# NOTICE TO USER:   

#

# This source code is subject to NVIDIA ownership rights under U.S. and 

# international Copyright laws.  

#

# NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE 

# CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR 

# IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH 

# REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF 

# MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.   

# IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, 

# OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS 

# OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE 

# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE 

# OR PERFORMANCE OF THIS SOURCE CODE.  

#

# U.S. Government End Users.  This source code is a "commercial item" as 

# that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting  of 

# "commercial computer software" and "commercial computer software 

# documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995) 

# and is provided to the U.S. Government only as a commercial end item.  

# Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through 

# 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the 

# source code with only those rights set forth herein.

#

################################################################################

#

# Build script for project

#

################################################################################

# Add source files here

EXECUTABLE	:= test

# Cuda source files (compiled with cudacc)

CUFILES		:= test.cu

# C/C++ source files (compiled with gcc / c++)

CCFILES		:= 

################################################################################

# Rules and targets

include ../../common/common.mk

One thing that immediately stick out is that you do not allocate memory for [font=“Courier New”]sdata[/font]. Call your kernel like this:

checkMatchOnDevice <<< nBlocks, threadsPerBlock , sizeof(int) * threadsPerBlock >>> (blockArray);

Wow… I knew it had to be something simple. Thanks so much. I owe you a beer