I have n items I need to compute their pairwise correlations. Each item has its own numerical data. There will be n*(n-1)/2 calculations to be made. An intuitive way to do this is to put each of these calculations in one thread such that we have n*(n-1)/2 threads.
But is there a way to further optimize it? I think some of the calculations might make use of the shared memory as all threads involving item 1 can share item 1’s data. Then I can have n-1 blocks of thread running.
This seems like a quite standard problem. Is there a standard way to optimize it in a parallel computing setting?
The solution would largely depend on the size of each item in memory, and how much calculation is involved. How many items would fit into shared memory?
For my problem, each item has about 10,000 floats but I can make it smaller if I sacrifice some accuracies. 10,000 floats can fit into 48kb shared memory.
Ok, that would allow you to keep one item in shared mem while looping over all others, reducing memory bandwidth to 1/2. Which naturally implies that you need to find a way to use a whole thread block (and a rather large one, to keep occupancy high) to process one pair of correlation.
I assume memory is read only once for each pair? Otherwise it would make more sense to optimize memory access patterns within each pair instead, e.g. if you compute a correlation function for each pair.
Traditionally pairwise computes can be broken down into a two level hierarchy for GPU computing. You take your grid of pairs and bin that into coarse squares. Each block then computes each coarse square. Sometimes you need a short followup kernel to stitch or reduce the subsquare results together.
This is a very general strategy that tends to fit the GPU well… the coarse squares are schedulable blocks with no interdependencies. Yet one square by itself only accesses a limited range of data so it tends to be more memory efficient.
The typical strategy is to pick the coarseness to be fine enough to have enough blocks to easily distribute over your SMs… typically 200-20,000. Less than 100 tends to be inefficient. But you also want the squares to be as large as possible to reduce the need to re-read memory. The maximum size of a square tends to be limited by the amount of data you can hold at once… your shared memory and registers combined. If you’re memory bandwidth limited (by far the most common case in many many domains) and you can break your problem into subblocks of mxm, typically you’ll use only 1/m^2 the memory bandwidth of the straightforward “just read each pair” strategy. You can see how nice it is if you can get m to be large like m=16 or even m=32. Memory use per subgrid is the common limiting factor.
Now this very vague advice may not apply to your problem. If one single data item has 40K of memory just by itself… then you can’t even easily load 2 at once, much less 256, so it seems possible you’re going to have to stream the whole database through device memory N^2/2 times… sounds expensive!
But your problem is still so vague (" I have many pairwise computes and each item is 40K in size") that it’s impossible to give better strategy advice without more details.
All computations are independent. Pairs of 10,000 floats are used in an EM algorithm to obtain the pairwise answer.
While my per item floats can be reduced to 5,000 or even 2,500 by sacrificing some accuracies, I doubt 48k shared memory can hold many of them at once. So it seems like the best I can do is to only hold one item’s data in the shared memory.
Since you appear to be targeting a Fermi GPU, I’d be curious if the L2 cache would help broadcast an “item” (the 10,000 floats) to many multiprocessors. This would rely on the scheduler issuing blocks in some regular order (why wouldn’t they? :) ), so that the first N-2 blocks would all be comparing item 0 to items 1…N-1. (I am assuming you would launch one block per item pair.) This way, item 0 has the potential to be fetched from the L2 cache rather than global memory. In fact, for maximum benefit, you might want to use some inline PTX to force the second item of the pair to always bypass the L2 cache, to avoid pushing the first item out prematurely.
So you run an iteration for each pair of items? That would really make it worthwhile having a pair of items per block in on-chip memory. Luckily with one item in shared memory L2 cache (768 KB) is large enough to hold the other 15*40KB. Seibert’s comment about bypassing the L2 cache for the item that is loaded into shared memory seems quite important then.
If you could reduce each item to 8,000 floats, you could place half of the second item in shared mem also, hoping that the other half would be served from L1 cache. That would keep the data for iteration over one pair completely within the SM, reducing (on-chip) traffic even more.
Let me suggest a higher-level viewpoint: computing all pairwise correlations is essentially computing a matrix product. If each item is a column in a matrix X, then the pairwise correlation matrix is just the matrix X’ * X (where X’ denotes matrix transpose).
Hence I think your first approach should be to follow the usual matrix-matrix multiply paradigm, as explained in the CUDA programming manual (and a bunch of other places). Break the computation into blocks of B x B (eg 16 x 16); within each block, load the data into shared memory; compute the partial answers; then move to the next block.
There are of course many many optimization possibilities, some of which other people have touched on. Check out Vasilkov + Demmel’s paper “Benchmarking GPUs to tune dense linear algebra” if you really want to dig in, though Fermi offers some new possibilities.
Since you are computing O(n^2) values, the algorithm will naturally require O(n^2) work, but it should parallelize very nicely.
On the other hand, if you are only interested in, say, finding the most correlated pair, or least correlated pair, then you might be able to reduce the total amount of work with a data structure of some sort.