I am having a GPU kernel as follows:

```
void doSomething( RawDataArray, DecisionArray, Result)
{
if(threadIndex < RawDataArray.lenght())
{
int index = process(RawData)
if(DecisionArray[index] > threshold)
{
AtomicAdd(Result, 1)
}
}
}
```

Basically, every thread computes an index to the DecisionArray. Then, it reads the values in that particular location of the DecisionArray and makes a decision to whether increment Result or not.

I was wondering what will be the time-complexity of this kernel? I was thinking the computing time should depend on the length of the RawDataArray and the number of the AtomicAdd operations, and it should be independent of the length of DecisionArray because every thread is accessing just a location of this array once. However, after implementation, I found out that this kernel’s runtime heavily depends on the length of the DecisionArray and I can’t figure it out why? An increase in the length DecisionArray results in prolonged runtime. What am I missing here? I will be happy if you can help me with this.