Algorithm query...

Hi, I’m new to CUDA, so pls. excuse the ‘newbie’ query. I’m trying to decide on the best way to parallelise this problem.

Suppose I have two sets of data, these are essentially key/value pairs. The keys are almost always the same,
but I can have keys that may be in one set and not the other:

a = [ (“1D”, 2.1) ( “2D”, 2.2), (“1M”, 2.4) ]

b = [ (“1D”, 2.2) ( “2D”, 2.3), (“6M”, 3.1) ]

The strings “1D” etc… are atomised, so are stored as ints.

I can have large input sets, hence the goal to put on the GPU and parallelise an (a-b).

The output would be a-b, would be: (“1D”,-0.1), (“2D”, -0.1), (“1M”, 2.4), (“6M”, -3.1)

Any ideas on the best approach to this problem? In particular I’m not sure how (or if I can) parcel up sections of the list for each thread to work on …

Its basically parallelising this bit of F# code:

let irdelta =
let tenors = union (List.unzip a |> fst) (List.unzip b |> fst)
let a’ = Map.ofList a
let b’ = Map.ofList b tenors ( (fun x -> overlay (a’.TryFind x) (b’.TryFind x) ) tenors)

i.e. Is there an efficient way around createing the maps, key - value pairs??

How large can the large input datasets be ?
and how fast do you need it to be ?
is the order of the output important ?

Looking at how fast the n-body example on the SDK is an approach where, a thread is assigned for every key in a and then that thread searches every value in b to see if it matches, would work for lists that are tens of thousands of elements. (repeat with a thread for every element in b

Another way that comes to mind is to merge a and b into a single list and sort it using one of the sorts designed to run well in parallel, then with as many threads (total) as the length of the merged list have each thread check if its key is the same as either of its neighbours.

This sounds like a segmented scan. The thrust template library has a suite of scan and prefix operators that you might be able to adapt with only a handful of lines of extra code.

Thanks very much for both the suggestions.

Just to follow up, thanks very much for the tip on thrust and also the tip on combining into a single list …

I stumbled across this, which more or less shows exactly how to solve my problem in a tiny amount of code …

const int N = 7;

int A[N] = {1, 3, 3, 3, 2, 2, 1}; // input keys

int B[N] = {9, 8, 7, 6, 5, 4, 3}; // input values

int C[N]; // output keys

int D[N]; // output values

thrust::pair<int*,int*> new_end;

thrust::equal_to binary_pred;

thrust::plus binary_op;

new_end = thrust::reduce_by_key(A, A + N, B, C, D, binary_pred, binary_op);

That said, I think the idea of using a thread per element was useful, made me realise … I’m not using a CPU, a GPU, I can have hundreds of threads.