Kernel performance w.r.t an array length

I am having a GPU kernel as follows:

void doSomething( RawDataArray, DecisionArray, Result)
    {
       if(threadIndex < RawDataArray.lenght())
       {
          int index = process(RawData)
          if(DecisionArray[index] > threshold)
          {
            AtomicAdd(Result, 1)
          }
        }
    }

Basically, every thread computes an index to the DecisionArray. Then, it reads the values in that particular location of the DecisionArray and makes a decision to whether increment Result or not.

I was wondering what will be the time-complexity of this kernel? I was thinking the computing time should depend on the length of the RawDataArray and the number of the AtomicAdd operations, and it should be independent of the length of DecisionArray because every thread is accessing just a location of this array once. However, after implementation, I found out that this kernel’s runtime heavily depends on the length of the DecisionArray and I can’t figure it out why? An increase in the length DecisionArray results in prolonged runtime. What am I missing here? I will be happy if you can help me with this.

to answer your question, we need a reproducible testcase, including your method of measuring performance, and cuda/gpu version. one possibility is that you include time of copying DecisionArray from cpu to gpu, especially if you are using managed memory

I am using GeForce 970 GTX. This kernel is part of a collision detection algorithm that I have implemented. Measuring of the performance is done by counting the OpenGL rendering frame and also using std::chrono high_resolution clock before and after the collision detection.

The whole collision detection has two parts. The first part is executed during initialization. At this stage, I call cudaMalloc for the three kernel arguments and make two cudaMemcpy calls to copy the DecisionArray and Result to the GPU.

In every rendering frame, I call cudaMemcpy to copy the updated RawDataArray into the kernel, then I launch the kernel. The time I measure corresponds to this stage and the initialization is not included. So, I think the CPU to GPU copy time for DecisionArray should not have any effect on the measured time.

This is a little bit more informative code of my kernel

__global__ void doSomething(int * RawDataArray, int * DecisionArray, int * Result)
    
        {
	       int threadIndex = blockIdx.x * blockDim.x + threadIdx.x;
           if(threadIndex < RawDataArray.lenght())
           {
              int index = process(RawDataArray)
              if(DecisionArray[index] > threshold)
              {
                AtomicAdd(Result, 1)
              }
            }
        }

My intention was to keep the DecisionArray in global memory. Since access to an array element is O(1) and every thread accesses to the DecisionArray only once, all threads should spend somewhat similar time to execute. Therefore, for a fixed number of threads, the length of DecisionArray must not affect the runtime. Am I right?

For longer DecisionArray, the kernel performance is very poor. What am I missing?

there is no reproducible case. we don’t know that else changing with the increasing of DecisionArray. my next guess is increased contention on the atomic variable is really the issue

i understand that it’s hard to make a small reproducible case out of larger program. but look from other side - all we know is that some array in your program became larger. and then you ask why it’s slower. without even info how much calls are made into the kernel. if you can’t extract reproducible case from the program, it’s up to you to represent just enough info to answer your question

if you believe that the kernel you cited really reproduce the problem - try to build test case from it and check whether it really reproduces the problem. if it will do - you can give us the testcase. if it’s not - you will see that you have not given us enough info to help you, and may look further

I appreciate the time you spent to answer the question. I am not clear on what you mean by a reproducible? What kind of information do you want me to add? Do you want to have a host and kernel code so that you can run it on your computer?

yes, unless some more competent programmer will go and answer your question from the info you provided

in general, it’s not enough to say “it works slower”, you need to give us the code that can really be built and run, and then ask questions about this code. it’s usual practice for programming questions, but for CUDA it tends to be especially important due to its complexities

How large is DecisionArray?
Smaller arrays may completely fit into cache, while the caching efficiency will rapidly fall off once the cache size is exceeded.

There are few cases when time of access to DecisionArray may depend on its size. First, your GPU has ~2MB cache. Data may be cached between kernel calls.

Second, adjancent data may be cached because GPU caches has 32/128 byte line size. I.e. if for small DecisionArray the gaps between accessed elements is less than 128 bytes, this may lead to fewer memory accesses.

And third, for really large arrays, say >1GB, random access becomes much slower - probably due to TLB shortage

Thanks for the points you mentioned.
I have printed out the index every thread computes and realized, for example, the element DecisionArray[130095] has been accessed by these threads:

3415
3414
3413
3538
3537
3477
2938
2996
3056
3055

The DecisionArray is type integer with a length of 2097152 which is around 8 MB. It looks like that I have random access to a location in DecisionArray multiple times and I suspect this causes slow down. How can I accelerate this? Can I put 8 MB into shared memory? Is there a way to cache these?

when the same data are accessed by multiple threads in the same warp, it doesn’t incur any cost. you can find more info in the CUDA Manual

i think that now you need to build a model of accesses to this array in order to find how to optimize accesses. and no, shared memory isn’t large enough to hold 8 MB and anyway it doesn’t last between thread blocks. But there are may be other issues such as non-coalesced memory access which may be solved by its use

Please correct me wherever I am wrong.

Given the above discussion, I think I do have a cluttered access to the memory. If I can crack down the array into smaller pieces and pass them to local memory, I think, we can solve the problem. That being said, I want to copy smaller and distinct chunks of the decisionArray into shared memory. Then, I have to make sure the threads within the corresponding block are selected in a way that will not access to any other chunk of the array in other blocks.

Do you have an example of such algorithm in mind? Something that I can look and grab an idea how to do it?

it seems so. but it isn’t problem by itself. but may be you can improve the code in some way

note that “local memory” has specific meaning in CUDA glossary, it seems that you mean here just shared memory

i think you should proceed other way - first, learn access patterns of global and shared memory from the link i just cited. second, build the model of memory accesses in your code. and then analyze how it works against GPU hardware and look for the ways to improve it

Alternative is to publish this part of your code and ask here to help you. Of course, no guarantees, but here there are very skilled people

CUDA manual may be not the best place to learn anything, there are also books and videos. Overall, i think your idea of copying some code example doesn’t look feasible, instead you should learn how GPU work and apply this knowledge to your code