mapping random memory writes from glsl to cuda cuda implementation 20 times slower than glsl one.


I used glsl before and are now trying to migrate my calculations to cuda, but i have a computation that i can not seem to map to cuda.
The problem is quite simple. I calculate lots of coordinates (roughly a million) and want to gather those in discrete buckets.
In glsl i simply created one vertex for each coordinate, let the vertex shader position it according to a texture lookup in
the texture with the calculated coordinates and perform an add operation in the fragment shader.

I know that random memory access to global memory is slow, so that is the part where my cuda kernel looses all the time.
But i am completely at a loss with how to solve this. It is not possible to render the coordinates into shared memory, because there is
simply not enough shared memory available. The coordinates that are calculated are not predictable in any way, so grouping is out of the
question. And sorting 1m elements, well, not the best idea i think.

So the final question is: Is there a way to map the described glsl to the cuda architecture?
How does glsl manage to serialize those memory writes so fast and what do i have to do in cuda to get comparable perfomance?
At the moment the cuda implementation is about 20x slower than the glsl one.

I am grateful for any tips or hints towards sources where i can find the answer,

So, you’re basically computing a histogram of the randomly generated points? Check out the histogram samples in the SDK, or perhaps the Thrust Library

Yep, it’s basically an 2d histogram.

If i understand the histogram SDK samples correctly, they skip over the data in a clever way to circumvent the memory bottleneck. But i still have the problem, that i have way to many buckets to fit into the shared memory.

Is there no way to get to the method that the OpenGL environment seems to be using?
OpenGL has no problem to raster large amounts of vertices to a framebuffer to arbitrary positions. Why cant i get there with cuda?

I seem to be missing something. :unsure:

Well, using GLSL would have written to the framebuffer, since only CUDA supports scattered writes. Perhaps you can look into the CUDA/OpenGL interop API and do something with the framebuffers exposed by it to replicate your GLSL code.

EDIT: Another way might be to break your buckets into multiple sets, make multiple passes over your data, and use the SDK histogram techniques to work on each ‘set’ of buckets. This way, the programming should be fairly simple, you get the benefit of shared memory, and you can easily increase the number of buckets without having to totally re-write your code every time.

For example, if you have 2048 buckets, and you can only store 256 buckets in shared memory (due to memory needs or whatever), then make 8 passes over your data. Pass 1 will update buckets 0-255, Pass 2 will update buckets 256-511, …, Pass 8 will update buckets 1791-2047.

Thanks for the suggestions, i’ll look into the interop thingy.

The buckets in multiple passes is not promising, since i have about 2,000,000 buckets and can fit ~1100 of them into shared memory…
That would leave a lot of passes.

I am still disappointed that there is no way to access the same functions in an purely cuda way.