Missing writes to global mem

Hi,

I get some missing writes for this simple accumulator kernel:

extern __shared__ float s_c_accumulators[];

__global__ void d_accumulate(float *d_c_accumulators, int nClasses)

{

  float *c_n = s_c_accumulators;

  float *c_d = &s_c_accumulators[nClasses];

//load c_n and c_d into shared memory.

  int threadID = threadIdx.x + (threadIdx.y + threadIdx.z*blockDim.y)*blockDim.x;

  if(threadID < 2*nClasses)

	s_c_accumulators[threadID] = d_c_accumulators[threadID];

  __syncthreads();

	

  for(pitchz=0;pitchz<nClasses; pitchz++)

  {

	c_n[pitchz] = c_n[pitchz] + 1.0;

	c_d[pitchz] = c_d[pitchz] + 1.0;

  }

//Write out results...

  __syncthreads();

  if(threadID < 2*nClasses)

	d_c_accumulators[threadID] = s_c_accumulators[threadID];

}

d_c_accumulators is device memory and is initialized to 0.0 before the kernel call. I get these values in d_c_accumulators:

     1347.000000	     1270.000000

     1511.000000	     1427.000000

     1587.000000	     1510.000000

     1659.000000	     1588.000000

     1678.000000	     1631.000000

     1719.000000	     1666.000000

     1737.000000	     1677.000000

     1733.000000	     1682.000000

Which obviously is completely wrong. My launch config is:

d_accumulate<<<32*32*32, (4, 4, 4), 2*8*sizeof(float)>>>( d_c_accumulators, 8);

Why should that happen? I should see all the same values in the accumulator array. Does it have to do with collisions? If yes then why should some writes fail?

Thanks,

Oj

Please read:

for(pitchz=0;pitchz<nClasses; pitchz++)

  {

	c_n[pitchz] = c_n[pitchz] + 1.0;

	c_d[pitchz] = c_d[pitchz] + 1.0;

  }

as

for(int i=0;i<nClasses; i++)

  {

	c_n[i] = c_n[i] + 1.0;

	c_d[i] = c_d[i] + 1.0;

  }

It looks like you have two problems. One is that all threads in a block will attempt to execute this simultaneously, leading to a race condition:
c_n[i] = c_n[i] + 1.0;
If each thread incremented it individually, (which might happen in emulation mode) then it might get incremented by the number of threads in a block. But if all threads in a warp load the same value for c_n[i] and increment it, and then attempt to store the new value, it could get incremented once per warp. But also possible is if a warp loads the old value in between the load and store of the other warp, which would effectively cause the increment to be even less. This is totally dependent on how the threads happen to be scheduled.

The other potential problem is basically the same problem but at the block level. Two or more blocks could load the values, update them separately, and then write the results back. It’s also possible that one block could be reading them while another block is writing them, making the values different even from each other.

That’s so true! Thanks for pointing out.

Then is it really possible to have an accumulator like this work (efficiently, without serializing)?

-Oj