Problem on my histograma

This is the kernel of my histogram. On emuDebug, it works so good, but on release it can’t. In this code, a is the matrix and b is the out vector. *e is the value of each cell of the matrix. I think that there is problem on syncronize.

global_ void test(int *a, int *b, size_t pitch){

int iy = blockDim.y * blockIdx.y + threadIdx.y; //line
int ix = blockDim.x * blockIdx.x + threadIdx.x; //col.

if (ix >= C || iy >= L)

int e = (int)((char*)a + iy * pitch) + ix;
b[*e] += 1;

int bx = (C + BLOCK_SIZE-1) / BLOCK_SIZE;
int by = (L + BLOCK_SIZE-1) / BLOCK_SIZE;
dim3 blocks(bx, by);
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
test<<< blocks, threads >>>(a, b, pitch);

Thanks in advance.

Sidney Lima

You’ve got a race condition with multiple threads reading and writing the cells of the b array simultaneously.
Since Emu mode is single threaded, you don’t see these problems.

You could solve this with global atomics, which is simple but not the most efficient.

There are histogram examples in the CUDA SDK which would be the best place to study.