I’ve only been working in CUDA for about a week now and I’d like some help thinking about a problem. I can post more details later if necessary, but the gist of what I am trying to do is this:

Given an input array i of 1024 ints, calculate d[y] = i - i[y] for every x and y. Output an array o of 1024 ints where o = y where y is the index in i that is the minimum of d (x != y). Does that make sense?

As an example:

If i =

[3, 5, 6, 3]

o =

[3, 2, 1, 0]

When the answer is ambiguous because both d[y0] and d[y1] are equal and less than all other d's, it doesn’t matter if o is y0 or y1.

Naturally I’m trying to maximize the throughput. Also, I am using a compute 1.1 device. Do any of you have pointers on ways I can maximize the efficiency? Or is anyone aware of a domain where this problem is solved so I can do some reading? Thanks!