I’m trying to speed up the process of searching all maxima inside a d-dimensional point set, and thought CUDA might be suitable to parallelize this workload.
However, I’m quite new to CUDA and not sure on how to tackle this problem. Almost all beginner tutorials teach how to work on embarrassingly parallel workloads with a known result size. Is there any problems similar to this one, so I can refer to? Or could anyone suggest some steps to parallelize this problem? Any tips are welcome. Thanks!
Assuming that a first processing step produces a sparse vector of results (e.g. points that are maximum are preserved, all other points set to zero), a common mechanism for removing the unwanted elements from that vector is called stream compaction. The Thrust library for example has a built-in method for this. There is also a decent amount of literature available.
In order to make good use of modern GPUs you want parallelism on the order of O(10,000).
In the first step you mentioned, suppose my point set is very large, say 2^20 points with 10 dimensions, how should I partition/tile the points? In my solution I coded, there is a huge bottleneck in the access of the large dataset as well as storing the result into the long sparse vector.
I am not familiar with your particular use case. In my experience it is much easier for domain specialists to pick up necessary CUDA / parallel programming knowledge, then for people with knowledge of CUDA to pick up necessary domain expertise.
In general, if there is minimal data re-use, simply rely on the massive memory bandwidth of the GPU to stream the data through. If there significant data re-use, make use of shared memory for buffering data. That could be in the form of a two-dimensional tile, or multiple (revolving) layers of 2D tiles for a 3D-stencil. I haven’t really thought how one would best buffer a 10-dimensional hypercube; it’s not a problem I have encountered so far.
Note that if this is the only processing of the data you do, the cost of transporting the data to the GPU may be a significant part of overall processing time. It would be advantageous if surrounding processing steps also happen on the GPU, so that the data can be resident on the GPU for multiple processing steps, thus amortizing the copy cost.