Inaccuracy with Nested (2) For Loops Is there an issue with nested for loops

Hey folks,

            I was working on something which requires to load multiple batches of data to the shared memory and process further. I would reduce my problem to a snippet of code and would be very 

thankful if someone could share his/her thoughts

Given the following code runs for say just 1 Block

When I read back the variable “tracknum” I certainly get 5000 from both the first and second index of tracknum

On the contrary if I use the following code

Assume MAX_DATA to be something which can be stored in Shared Memory

The output for tracknum[0] stays 5000

and tracknum[1] reports 2000

I am using CUDA 1.1 just so if it matters.

Thank you for any information you could share



i don’t have cuda 1.1 card and am not familiar with atomic funcs.

is it to do with the __syncthreads(); ? You repeat atomicAdd for different times among the threads. i’m eager to know how the hardware ensure this atomicAdd and the subsequent _syncthreads().

Please be more clear on what values you expect, because we cannot say anything about the output if we don’t know the input. I suppose numData*numpass is 2000?

I think if you look at the code you will realize that the output should be 5000 in the second case too. numData takes the value of MAX_DATA untill the data left is less that MAX_DATA in that case whatever is left is the value of numData.

thus for a case of 5000

numData would take the value 256 for 19 times (consider MAX_DATA = 256)

and the value 136 for the last pass.

thus I should get 256 * 19 + 136 = 5000 as the output even for trackNum[1]

Let me know if something is not clear

Thanks for reply

I am sorry for a typo in the above posting, an instance of tricount should be read as dataCount. I am sorry about that