atomicAdd and shared memory issue Running the histogram code from "Cuda by example" book.

Hello everyone!

I’m on a MacBook Pro with 9400M chip and CUDA 4 installed and working.

I’m having troubles running the histogram example from “CUDA by example” book.

After some test I’ve find out that the problem occur when the code is modified to perform atomicAdd on shared memory.

Here is the issued kernel code:

__global__ void histo_kernel( unsigned char *buffer,

                                long size,

                                unsigned int *histo ) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

   int stride = blockDim.x * gridDim.x;

    __shared__ unsigned int temp[256];

    temp[threadIdx.x] = 0;

    __syncthreads();

while (i < size) {

        atomicAdd( &(temp[buffer[i]]), 1 );

        i += stride;

    }

    __syncthreads();

    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );                  

}

Everithing works fine if I use the atomicAdd on the global memory like this:

__global__ void histo_kernel( unsigned char *buffer,

                                long size,

                                unsigned int *histo ) {

int i = threadIdx.x + blockIdx.x * blockDim.x;

   int stride = blockDim.x * gridDim.x;

while (i < size) {

        atomicAdd( &(histo[buffer[i]]), 1 );

        i += stride;

    }                 

}

But the shared memory version returns an empty array :unsure:

There is one more thing, and is quite scary.

I’ve tried to find out what the problem was by compiling with -g -G flags and running cuda-gdb,

resulting in a complete system freeze when entering the kernel scope :confused:

Any ideas?

Thanks in advance!

These are my GPU features:

Device 0: "GeForce 9400M"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 254 MBytes (265945088 bytes)

  ( 2) Multiprocessors x ( 8) CUDA Cores/MP:     16 CUDA Cores

  GPU Clock Speed:                               1.10 GHz

  Memory Clock rate:                             1062.50 Mhz

  Memory Bus Width:                              128-bit

  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Concurrent copy and execution:                 No with 0 copy engine(s)

  Run time limit on kernels:                     Yes

  Integrated GPU sharing Host Memory:            Yes

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   No

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      No

  Device PCI Bus ID / PCI location ID:           2 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Atomic operations on shared memory are only supported from compute capability 1.2 onwards, your Geforce 9400M is 1.1.

I see…

Next time I should read much carefully the manual :pinch:

Thanks.