I was working on something which requires to load multiple batches of data to the shared memory and process further. I would reduce my problem to a snippet of code and would be very
thankful if someone could share his/her thoughts
Given the following code runs for say just 1 Block
When I read back the variable “tracknum” I certainly get 5000 from both the first and second index of tracknum
On the contrary if I use the following code
Assume MAX_DATA to be something which can be stored in Shared Memory
i don’t have cuda 1.1 card and am not familiar with atomic funcs.
is it to do with the __syncthreads(); ? You repeat atomicAdd for different times among the threads. i’m eager to know how the hardware ensure this atomicAdd and the subsequent _syncthreads().
Please be more clear on what values you expect, because we cannot say anything about the output if we don’t know the input. I suppose numData*numpass is 2000?
I think if you look at the code you will realize that the output should be 5000 in the second case too. numData takes the value of MAX_DATA untill the data left is less that MAX_DATA in that case whatever is left is the value of numData.
thus for a case of 5000
numData would take the value 256 for 19 times (consider MAX_DATA = 256)
and the value 136 for the last pass.
thus I should get 256 * 19 + 136 = 5000 as the output even for trackNum[1]