I used glsl before and are now trying to migrate my calculations to cuda, but i have a computation that i can not seem to map to cuda.
The problem is quite simple. I calculate lots of coordinates (roughly a million) and want to gather those in discrete buckets.
In glsl i simply created one vertex for each coordinate, let the vertex shader position it according to a texture lookup in
the texture with the calculated coordinates and perform an add operation in the fragment shader.
I know that random memory access to global memory is slow, so that is the part where my cuda kernel looses all the time.
But i am completely at a loss with how to solve this. It is not possible to render the coordinates into shared memory, because there is
simply not enough shared memory available. The coordinates that are calculated are not predictable in any way, so grouping is out of the
question. And sorting 1m elements, well, not the best idea i think.
So the final question is: Is there a way to map the described glsl to the cuda architecture?
How does glsl manage to serialize those memory writes so fast and what do i have to do in cuda to get comparable perfomance?
At the moment the cuda implementation is about 20x slower than the glsl one.
I am grateful for any tips or hints towards sources where i can find the answer,