Adaptive Resonance on CUDA

I am going to implement adaptive resonance theory (ART) on CUDA. The core (in the sense of computational complexity) of the algorithmic version of ART is a search for the best matching template for a given input. The match score is a function of three ratios. Computation of these ratios is where I hope CUDA can help speed up my application. The ratios are:
the dot product of the input vector with itself (not really important as it’s only done once over all the templates)
the dot product of a template vector with itself.
the sum of the min of each input element with each template element. I.e. for input i and template t: min(i1, t1)+min(i2, t2)+min(i3, t3)+…
We can assume that each input will need to be checked against several hundred (or more) templates. This is an expensive double loop (each vector element for each template) when done sequentially.
My main question is what would be a good method to do the element-wise min and sum? I would like to figure out the best use of threads and blocks and memory types for this problem. Any advice would be welcome! Please let me know if I can describe the problem better.
I would also be happy to give more details on ART, i tried to give the bare minimum needed for my computation problem. There is a decent article, with more resources, on wikipedia if anyone just wants more info in general on how the rest of ART works.


So, for all x compare to all y, is that correct? If so, you can have all your x loaded as registers in threads, read in a chunk of y in shared memory, and let all your x compare to all y starting from y[0]. This would take advantage of the broacast mechanism in shared memory allowing all threads to read simultaniously from a single location and give you a speedup equal to that of the combined bandwidth and/or computational power.

Thanks, this is exactly the sort of pointer (no pun intended) I was looking for. Let me make sure that we have the same view of the problem though. I think what you said will apply, I’m just not sure I can fit my problem into “for all x compare to all y.”

Let x be an input vector and y a vector that we want to compare with x, there are many different ys but only one x. We want to compare one vector x to all vectors y. The first element of x will be compared to all first elements of y. So, based on your suggested approach, I may be able to take advantage of the same x[1] being compared to many different y[1]s.

Considering one parallel set of comparisons, I could load each y[1] (from the many different ys) into a different thread register, then load x[1] into shared memory. Then, all y[1]s would be able to compare with x[1] simultaneously. Repeat for x[2] and y[2]s, etc.

Hopefully the register and shared memory will be big enough that I can load complete vectors and not each element separately.

TIA for any comments, suggestions, etc.

So more like: do r += strcmp( x , y[i++] ) while( i < max) ?

This would be

#define tx threadIdx.x

// loop body

if x[tx] < y[i][tx] shared[tx] = x[tx] else shared[tx] = y[i][tx]

Finish by summing up shared … The number of registers and size of shared is in the programming guide. Now would be a good time to open it up. On the surface it sounds like your project fits the architecture very nicely.