Algorithm Strategy brainstroming.. Please help me choose the best algo..

nitin.life · December 18, 2009, 12:44am

Hi all…

I have a problem which I am trying to solve on CUDA… its related to calculating space trajectories for interplanetary stuff…

The problem is as follows:

any bold alphabet represents a float4 vector.[b]

[/b]

Say, I have 30720 (just for this example) pairs of two position vectors r1 , r2 + I have bunch of other float 30720 values which go along with each of these pairs.

Now there are two function calls…

the first call takes in a pair of r1,r2 and its associated other values and calculates a parameter called n. If this parameter is greater than namx is made equal to nmax (set by the user stored in constant memory).

Now we define a parameter i which ranges from 0 to 2*n+1

the second function takes in the value of current r1 and r2 and the i parameter and calculates V1 and V2 which are basically the velocities of the space craft. Note that i ranges from 0 to 2n+1 hence for given r1 and r2 we can have at maximum 2nmax+1 values of v1 and v2. These can be done in parallel.

In a dummy code it looks like this…

30720 times: 

		FUNCTIONA(r1,r2,bunch of other values,n)

				   if(n>nmax)n=nmax;

	  for(i=0;i<2*n+1;i++){

		 FUNCTIONB(<b>r1,r2</b>,bunch of other values,i,<b>V1,V2)</b>

		 }

The above code repeats for each of the 30720 pair and hence its fully parallel. But as we can notice there is one more level of parallelism we an exploit that … given the n value after functiona we can evaluate n calls to functionb also in parallel. So at maximum the parallelism is 30720*nmax. But as the n value is differnet for each pair of r1 and r2 the actual parallelism will be less but we cannot know that before we execute functiona on each pair.

Now another solution strategy I came up with was instead doing the whole above process for 30720 threads in oner kernel launch. I can do two kernel launches…

kernel a = launches 30720 threads and evalutes the functionA… stores the delta = 0 or 1 in a 30720*namx lentght of array corresposding to n<nmax

kernel b = launches 30720*namx threads and does the function B but now not all threads will compute… some threads will be left ideal when the delta[tid]=0… I suspect upto 20 % of threads mite be idle in some cases but now we do have much more parrelism to exploit. But there is a catch witht his approach… now each 5 theads have to read the same value of r1 and r2 from the gloabal memory which will destroy th coalscing as I see it… I cant find a way to preserve coalscing and still achieve this much parallelism…

So I was wondering if any of you super brains can help me out here… on making this algorithm port in a best way to the gpu ? I would really appreaciate any imptu from any one :-)

Fell free to ask if you have confusuion on the algorithm…

Thanks,

Nitin

Uncle_Joe · December 19, 2009, 7:51pm

Hi, hope you remember me. From my understanding, the main problem you’re trying to solve is to balance the work each thread does due to the different amount of computation each of the 30720 elements required. If the amount of work element i does is i, (which happens if you want to evaluate some function over all pairs of a set), then you could group them into pairs (0, n - 1), (1, n-2), … to achieve equal work, but that doesn’t apply for you.

One way to divide the work equally for the second computation stage is to build an indexing structure where element i tells you which of the top level elements it belongs to:

5   5	4	9	5   8  ...		   n for top level elements(30720)

0   5   10  14  23  28   36		 prefix sum

0   0   0   0   0   1   1   1   1   1  2  2   2  2  3  3   3   3   3  3   3   3  3  4   4   4   4   4  5   5  5   5   5   5   5   5				 indexing structure (size M)

Then for the 2nd stage, each thread will process elements tid + #thread_count * n, where tid is global thread id, thread_count is global thread count, and n is any integer

This should achieve coalescing (<= 2 transactions/warp load) for compute capability >= 1.2. Threads within a warp will read the same address, but there’s not much you can do about that.

Of course, the real challenge becomes how to compute the index table efficiently. One way is binary search of the prefix sums. But the thread divergence will probably kill performance, unless

you find a way to use predicated execution so that all threads iterate lg[30720] times.

The main question I have is what’s the average size of n for the top level elements and how evenly distributed? If they’re relatively evenly distributed, then splitting the processing into blocks of NMAX and having idle threads like you said might be better than the cost of computing the index table I described to achieve equal distribution.

nitin.life · December 22, 2009, 11:45pm

Hi, hope you remember me. From my understanding, the main problem you’re trying to solve is to balance the work each thread does due to the different amount of computation each of the 30720 elements required. If the amount of work element i does is i, (which happens if you want to evaluate some function over all pairs of a set), then you could group them into pairs (0, n - 1), (1, n-2), … to achieve equal work, but that doesn’t apply for you.

One way to divide the work equally for the second computation stage is to build an indexing structure where element i tells you which of the top level elements it belongs to:
5   5	4	9	5   8  ...		   n for top level elements(30720)

 0   5   10  14  23  28   36		 prefix sum

 0   0   0   0   0   1   1   1   1   1  2  2   2  2  3  3   3   3   3  3   3   3  3  4   4   4   4   4  5   5  5   5   5   5   5   5				 indexing structure (size M)
Then for the 2nd stage, each thread will process elements tid + #thread_count * n, where tid is global thread id, thread_count is global thread count, and n is any integer

This should achieve coalescing (<= 2 transactions/warp load) for compute capability >= 1.2. Threads within a warp will read the same address, but there’s not much you can do about that.

Of course, the real challenge becomes how to compute the index table efficiently. One way is binary search of the prefix sums. But the thread divergence will probably kill performance, unless

you find a way to use predicated execution so that all threads iterate lg[30720] times.

The main question I have is what’s the average size of n for the top level elements and how evenly distributed? If they’re relatively evenly distributed, then splitting the processing into blocks of NMAX and having idle threads like you said might be better than the cost of computing the index table I described to achieve equal distribution.

Thanks very much… yes I do remember you… fellow yellow jacket :) .

I tried the my above mentioned method and it works faster than the one you mentioned mainly because the n value for each thread varies a lot and is not evenly distributed as you pointed out. But I got one more idea of doing stream compaction. I will look into that and get back to you.

Am also considering removing the second level of parallelism as then I have lot less global memory copy/reads and it may be over all beneficial.

Uncle_Joe · December 23, 2009, 7:31am

That doesn’t make sense. If n varies a lot, then my method or stream compaction (I assume you mean sort top level elements according to n so that a warp can process identical size elements without divergence) would work better. Did you mean n doesn’t vary a lot?

nitin.life · December 26, 2009, 9:53pm

Well you were rite (i had some bug which showed other wise previously) … stream compaction and your method works better (15-20% improvement approx). But in actual implementation, my computation time even if I dont use the second level of parallelism is much less than the two way memory copy time. So am focusing on over lapping computations and stream. Once I am done with that… I will benchmark both the methods…

thanks for the help :rolleyes:

Uncle_Joe · December 26, 2009, 11:32pm

OK. That makes sense. By streaming, I assume you mean uploading, computing, and downloading blocks of the problem with pipeline ||ism. I tried that for some pattern recognizer, where the computation was cheap compared to the upload, but I found you need to have at least 10ms of work in before you reduce the cost of multiple kernel launches and get close to the expected overlap.

Did you use radix sort for stream compaction?