Hello everyone!
I’m on a MacBook Pro with 9400M chip and CUDA 4 installed and working.
I’m having troubles running the histogram example from “CUDA by example” book.
After some test I’ve find out that the problem occur when the code is modified to perform atomicAdd on shared memory.
Here is the issued kernel code:
__global__ void histo_kernel( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
while (i < size) {
atomicAdd( &(temp[buffer[i]]), 1 );
i += stride;
}
__syncthreads();
atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}
Everithing works fine if I use the atomicAdd on the global memory like this:
__global__ void histo_kernel( unsigned char *buffer,
long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &(histo[buffer[i]]), 1 );
i += stride;
}
}
But the shared memory version returns an empty array External Image
There is one more thing, and is quite scary.
I’ve tried to find out what the problem was by compiling with -g -G flags and running cuda-gdb,
resulting in a complete system freeze when entering the kernel scope External Image
Any ideas?
Thanks in advance!
These are my GPU features:
Device 0: "GeForce 9400M"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 254 MBytes (265945088 bytes)
( 2) Multiprocessors x ( 8) CUDA Cores/MP: 16 CUDA Cores
GPU Clock Speed: 1.10 GHz
Memory Clock rate: 1062.50 Mhz
Memory Bus Width: 128-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >