Cant really find much about this through my superficial search, so decided to post my method I’m implementing in my current app.

Many-to-many calculations is usually when you have many data points and need to do a computation against each and every other data point. In my app I’m doing string sequence distances for clustering.

A natural fit for this is a 2D thread block. You have your sequence on the left and top, with every thread doing a single computation. Problem with this however is that you will end up comparing every data point twice, as well as end up comparing every data point to itself. Since I have no plans to launch idle threads I decided to work out a nice way to map a 1D thread block to my upper-triangle grid. I needed a closed form solution since I wanted to avoid for loops and if statements as much as possible.

So, some sequence analysis later and I see that the sequence (…,0) conforms to (i^2-i-2)/2.

Some maths later and I swap around the terms:
x=floor(sqrt(2i+9/4)+1/2)
y=i-(x^2-x)/2+1

Those should be simple enough to implement on a GPU. Theres a sqrt in there, but the accuracy is not important so the faster version can be used.

I can’t believe that many-to-many computation is all that uncommon, so if I missed something about a better way to do it please let me know.

I see nothing wrong with your approach, particularly as the (lower precision) sqrt operation is quite fast on the GPU. In case you wonder what one may do if no floating point were available, I’ll refer you to my hastily created GIMP artwork:

I see nothing wrong with your approach, particularly as the (lower precision) sqrt operation is quite fast on the GPU. In case you wonder what one may do if no floating point were available, I’ll refer you to my hastily created GIMP artwork:

This happens in O(N^2) n-body calculations, which have been studied extensively on the gpu - but for them it’s usually faster to just do the double computation. There was an article in GPU Gems 3 about this.

This happens in O(N^2) n-body calculations, which have been studied extensively on the gpu - but for them it’s usually faster to just do the double computation. There was an article in GPU Gems 3 about this.

I was also thinking about some way to rectangularize the indexes, but so far I have been unable to come up with an elegant algorithm. It could be that I am unreasonably reluctant to use branches. It might end up faster than a formulatic approach if done properly, but I have prioritized using simple calculation in this case.

Well, I guess lookups would be faster if you can keep to the cached section of constant memory, but if you go over or you need that constant memory for something else it might be a bad idea. Upper limit is pretty much precision I guess? My one worry is about the fast lower accuracy sqrt(). The algorithm automatically fails if say sqrt(25)=4.99999 or sqrt(24)=5. Accuracy beyond that is not important, since it gets floored after anyway, but it needs to hit the right integer. I am unsure how high indexes has to be before sqrt() accuracy poses a problem or whether it even happens before you reach overflow errors on 16/32-bit numbers. If you want a higher upper limit than 1d addressing allows you could try something like tera suggests, since it allows you to have 2d thread addressing, with an if flipping the one section.

VERY interesting reading. I’m unsure if its of use to me, since the results of my GPU clustering gets sent back to the CPU to perform the actual clustering, something done best there, while this seems reliant on an accumilation buffer. If you had some basic source code I would love to browse it.

One worry I had with this is occupancy. So I calculated the rows by columns needed to get the amounts of threads a multiple of 32.

I was also thinking about some way to rectangularize the indexes, but so far I have been unable to come up with an elegant algorithm. It could be that I am unreasonably reluctant to use branches. It might end up faster than a formulatic approach if done properly, but I have prioritized using simple calculation in this case.

Well, I guess lookups would be faster if you can keep to the cached section of constant memory, but if you go over or you need that constant memory for something else it might be a bad idea. Upper limit is pretty much precision I guess? My one worry is about the fast lower accuracy sqrt(). The algorithm automatically fails if say sqrt(25)=4.99999 or sqrt(24)=5. Accuracy beyond that is not important, since it gets floored after anyway, but it needs to hit the right integer. I am unsure how high indexes has to be before sqrt() accuracy poses a problem or whether it even happens before you reach overflow errors on 16/32-bit numbers. If you want a higher upper limit than 1d addressing allows you could try something like tera suggests, since it allows you to have 2d thread addressing, with an if flipping the one section.

VERY interesting reading. I’m unsure if its of use to me, since the results of my GPU clustering gets sent back to the CPU to perform the actual clustering, something done best there, while this seems reliant on an accumilation buffer. If you had some basic source code I would love to browse it.

One worry I had with this is occupancy. So I calculated the rows by columns needed to get the amounts of threads a multiple of 32.

The method I cited can be used to make sure every thread gets the same workload, without using sqrt() etc. Much more simple and elegant I thought.

I’m looking for “basic code”. I used this some time ago on the opencl Nbody example. In the end I had to conclude that the straightforward asymmetric approach, which calculates interactions twice, was still 30% faster then my symmetric implementation. I still want to get back to this as I cannot accept this as unavoidable. So I got into Cuda instead of opencl because then I can debug and I hoped that this would help me find the problem and a cure; instead I got a bit lost, but am still learning and acquiring.

I recommend http://www.mcs.anl.gov/~itf/dbpp/text/node5.html (start wherever you want, going backward or forward). At times, it is deceptively simple but it does appear to get to the bottom of designing parallel algorithms.

Not needing accumulation makes things easier, as you can write a key for each (many to many) combination. I’m interested in you experiences, if you can keep us posted, that would be great.

The method I cited can be used to make sure every thread gets the same workload, without using sqrt() etc. Much more simple and elegant I thought.

I’m looking for “basic code”. I used this some time ago on the opencl Nbody example. In the end I had to conclude that the straightforward asymmetric approach, which calculates interactions twice, was still 30% faster then my symmetric implementation. I still want to get back to this as I cannot accept this as unavoidable. So I got into Cuda instead of opencl because then I can debug and I hoped that this would help me find the problem and a cure; instead I got a bit lost, but am still learning and acquiring.

I recommend http://www.mcs.anl.gov/~itf/dbpp/text/node5.html (start wherever you want, going backward or forward). At times, it is deceptively simple but it does appear to get to the bottom of designing parallel algorithms.

Not needing accumulation makes things easier, as you can write a key for each (many to many) combination. I’m interested in you experiences, if you can keep us posted, that would be great.

If the workload splitting was the only issue perhaps. In my case tho each comparison is independent of others, either a pass or fail that needs to be reported to the CPU. The independence means no real accumulation. The address offset of that result is the ‘key’, giving added value to a single index representing a single comparison. Balancing workload to me is a little more tricky in fact, since each comparison computation time is affected by the length of the two strings being compared, so if a single long string sneaks it I assume it kinda slows the warp to take the same computation time as that of the longest comparison. Kinda undesired, but it can be mitigated by sorting, still running experiments.

If the workload splitting was the only issue perhaps. In my case tho each comparison is independent of others, either a pass or fail that needs to be reported to the CPU. The independence means no real accumulation. The address offset of that result is the ‘key’, giving added value to a single index representing a single comparison. Balancing workload to me is a little more tricky in fact, since each comparison computation time is affected by the length of the two strings being compared, so if a single long string sneaks it I assume it kinda slows the warp to take the same computation time as that of the longest comparison. Kinda undesired, but it can be mitigated by sorting, still running experiments.