I want to implement a function similar to np.add.at() in NumPy, which accumulates data based on another index array. However, due to the randomness of the indices, memory access efficiency is very low. Additionally, I need to perform this operation on large datasets, so using shared memory doesn’t seem reliable.
I don’t have first-hand experience in this direction, but is seems the first place where you would want to look for this functionality is in PyCUDA and CuPy. If you cannot find anything relevant there, maybe look at Thrust.
That is a fundamental issue that you may be able to mitigate depending on the specifics of the use case, but is here to stay. The old joke applies: “Doctor, it hurts when I push here.” “Don’t push!”
Thank you for your suggestion. I found a function called thrust::scatter to implement similar functionality. Following this clue, I used the keywords ‘CUDA scatter and gather’ and found some related papers.