I am going to implement adaptive resonance theory (ART) on CUDA. The core (in the sense of computational complexity) of the algorithmic version of ART is a search for the best matching template for a given input. The match score is a function of three ratios. Computation of these ratios is where I hope CUDA can help speed up my application. The ratios are:
the dot product of the input vector with itself (not really important as it’s only done once over all the templates)
the dot product of a template vector with itself.
the sum of the min of each input element with each template element. I.e. for input i and template t: min(i1, t1)+min(i2, t2)+min(i3, t3)+…
We can assume that each input will need to be checked against several hundred (or more) templates. This is an expensive double loop (each vector element for each template) when done sequentially.
My main question is what would be a good method to do the element-wise min and sum? I would like to figure out the best use of threads and blocks and memory types for this problem. Any advice would be welcome! Please let me know if I can describe the problem better.
I would also be happy to give more details on ART, i tried to give the bare minimum needed for my computation problem. There is a decent article, with more resources, on wikipedia if anyone just wants more info in general on how the rest of ART works.