measure conflict when performing non-atomic write to the global memory

I have asked a similar question a month ago about global memory write and racing detection: http://forums.nvidia.com/index.php?showtopic=95358 , however, there wasn’t a clear explanation for the success of my accumulation operator, which,based on the replies, is supposedly to be failed. Now I have to find out what exactly happened.

Again, my kernel looks like

__global__ void mykernel(float *data){

  for(i=0;i<max_step;i++) {

	 // compute a value first

	 // a lot of math

	 myvalue=...;

	 pos=...;

	 // then add this value to a grid 

	 idx=get_1d_index(pos.x,pos.y,pos.z);

	 data[idx]+=myvalue;

   }

}

the code to compute myvalue is quite complicated, involving a lot of math and random numbers, and is about 200 lines of code (with a number of if-blocks). The integer “idx” is a 1D index to a 3D global array (data), computed from the current position pos (float3). Because this involves a random walk, different threads are very possibly writing to different part of data.

I used a thread block of dimension 128x1, and 8~15 blocks total.

Now I want to find out: how many of my threads happened to write to the same address at the same time (i.e. raced or conflict) during the execution of my program.

Because the output of my program is indeed very close to the anticipated solution, therefore, I believe, this happens with a very low probability, if racing indeed happened. However, I do need to find out what this probability was.

By running CUDA visual profiler, I noticed the following fields:

· gst uncoalesced : Number of non-coalesced global memory stores (844 for my sample run)

· gst coalesced : Number of coalesced global memory stores (250016530 for my sample run)

when conflict happens between different threads, will they be counted as “gst coalesced”? any other metric that I can use to estimate the conflict number?

Also, if conflict happens, will my results be a random number that has nothing to-do with the assigned values from the racing threads? or it will be one of those?

In another word, if the 844 non-coalesced writes are all the conflict write I have, will my results missed by a maximum of 844 events? or will be completely useless?

thank you!