Cuda shared memory within thread blocks problem

Jerrkawz · February 23, 2011, 9:03pm

I posted a question earlier but it was a bit complicated so I broke it down to the most simple case of not working that I could manage. Below is a cuda program that takes in a number of blocks and a number of threads. Each thread in the block uses shared memory to set a 1 to its block of memory. Then the threads wait. The 0 thread will then iterate shared memory and add up all the ones and write the answer to global memory for each block. So in the end the program should print out the number of threads in each block. Simple enough right? Well it works for any number of threads in a block < 8. For any number of threads per block greater than 8 it only prints out 8. It has to be something simple but I’ve yet to figure it out. I am also relatively new to cuda. Below is the code and the make file

test.cu:

#include <stdio.h>

#include <stdlib.h>

#include <stdint.h>

#include <cuda.h>

__global__ void checkMatchOnDevice(int * blockArray)

{

	extern __shared__ int sdata[];

  unsigned int tid=threadIdx.x;

	sdata[tid] = 1;

	__syncthreads();

	

	if (tid==0)

	{

		for (unsigned int a=0;a<blockDim.x;a++)

			blockArray[blockIdx.x] += sdata[a];

	}

}

int main(int argc, char *argv[])

{

  int nBlocks;

  int threadsPerBlock;

  if(argc != 3)

  {

    printf("Usage: numBlocks threadsPerBlock\n");

    exit(0);

  }

  else

  {

    nBlocks = atoi(argv[1]);

    threadsPerBlock = atoi(argv[2]);

		int *blockArray;

		int *hostBlockArray;

		cudaMalloc((void **) &blockArray, sizeof(int)*nBlocks);

    hostBlockArray = (int*)malloc(sizeof(int)*nBlocks);

		for(int i = 0; i < nBlocks; i++)

    {

      hostBlockArray[i] = 0; 

    }

		cudaMemcpy(blockArray, hostBlockArray, sizeof(int)*nBlocks, cudaMemcpyHostToDevice);

    checkMatchOnDevice <<< nBlocks, threadsPerBlock >>> (blockArray);

    cudaThreadSynchronize();

		cudaMemcpy(hostBlockArray, blockArray, sizeof(int)*nBlocks, cudaMemcpyDeviceToHost);

    //cudaMemcpy(hostMatchStartArray, matchStartArray, sizeof(int)*numThreads, cudaMemcpyDeviceToHost);

    //cudaMemcpy(hostMatchEndArray, matchEndArray, sizeof(int)*numThreads, cudaMemcpyDeviceToHost);

		int total2=0;

		for(int i = 0; i < nBlocks; i++)

    {

			total2 += hostBlockArray[i];

      printf("%d)%d\n",i,hostBlockArray[i]);

    }

		printf("Total: %d\n", total2);

  }

}

Makefile:

################################################################################

#

# Copyright 1993-2006 NVIDIA Corporation.  All rights reserved.

#

# NOTICE TO USER:   

#

# This source code is subject to NVIDIA ownership rights under U.S. and 

# international Copyright laws.  

#

# NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE 

# CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR 

# IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH 

# REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF 

# MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.   

# IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, 

# OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS 

# OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE 

# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE 

# OR PERFORMANCE OF THIS SOURCE CODE.  

#

# U.S. Government End Users.  This source code is a "commercial item" as 

# that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting  of 

# "commercial computer software" and "commercial computer software 

# documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995) 

# and is provided to the U.S. Government only as a commercial end item.  

# Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through 

# 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the 

# source code with only those rights set forth herein.

#

################################################################################

#

# Build script for project

#

################################################################################

# Add source files here

EXECUTABLE	:= test

# Cuda source files (compiled with cudacc)

CUFILES		:= test.cu

# C/C++ source files (compiled with gcc / c++)

CCFILES		:= 

################################################################################

# Rules and targets

include ../../common/common.mk

tera · February 24, 2011, 12:37am

One thing that immediately stick out is that you do not allocate memory for [font=“Courier New”]sdata[/font]. Call your kernel like this:

checkMatchOnDevice <<< nBlocks, threadsPerBlock , sizeof(int) * threadsPerBlock >>> (blockArray);

Jerrkawz · February 24, 2011, 4:04am

One thing that immediately stick out is that you do not allocate memory for [font=“Courier New”]sdata[/font]. Call your kernel like this:
checkMatchOnDevice <<< nBlocks, threadsPerBlock , sizeof(int) * threadsPerBlock >>> (blockArray);

Wow… I knew it had to be something simple. Thanks so much. I owe you a beer

Topic		Replies	Views
Problem when using more than 64 threads per block CUDA Programming and Performance	3	4330	May 4, 2010
Shared memory and global memory containg different values CUDA Programming and Performance	0	533	February 22, 2011
problem with shared mamery CUDA Programming and Performance	4	3235	May 11, 2009
memory function does not see the memory, although the device array is copied CUDA Programming and Performance	2	2594	May 29, 2010
big blockNum leads to error on shared memory block num and shared memory CUDA Programming and Performance	4	7072	March 9, 2007
Troubles with synchronization of threads (I think) CUDA Programming and Performance	0	2358	July 29, 2009
Unable to access the entire allocated space CUDA Programming and Performance	3	4029	July 2, 2009
CUDA racecheck, shared memory array and cudaDeviceSynchronize() CUDA Programming and Performance	10	2882	February 5, 2013
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2557	August 4, 2008
Problems doing shared memory test CUDA Programming and Performance	1	1557	February 17, 2008

Cuda shared memory within thread blocks problem

Related topics