Spreading loop iterations across threads, but serializing max operations

Hi there

In my threads I have a for loop like this:

// Stuff here

float highestScore = 0;
int highestIndex = -1;
for (int a = 0; a < num; ++a) {
score = compScore(a); // Costly function

// Max operation
if (score > highestScore) {
highestScore = score;
highestIndex = a;
}
}

// Stuff here

Now that number num can vary between the 32 threads in a block, so some threads might be waiting for others to finish.

What I’d like is a method to spread the evaluations of compScore equally across all the threads (so one thread might compute a score for another thread), but to then serialize the max operation at each iteration so if two threads are working on a single thread’s score then only the highest score updates highestScore.

I hope that’s clear, but if not, please let me know.

Do people know what methods are best for tackling this problem?

Thanks.

Hi,
are you limited to 32 threads in a block ?
or will your kernel still be happy with several warps per block.

If you can have for example 192 threads per block then can you arrange it that. (by sorting by ‘num’)
warp 0: 32 cases with highest num
warp 1: next highest 32

warp 5: 32 cases with lowest values for num

In this way all the threads in a warp would take about the same time to complete,
and warps with all threads completed will have already exited the “for (int a” loop and will just wait for the other warps to reach the same point before they continue. While they are waiting the warps that are still doing the for loop will get more of the thread processor time, so will finish faster.
Avoids a heap of writing/reading to shared memory, synchthreads, and extra code.

Many thanks. No I’m not limited to 32 threads per block, so could have more as you suggest. However, the value num is computed in the kernel, above the for loop. That value changes from thread to thread and iteration to iteration; it can’t be precomputed. This means the sorting would need to be done in the kernel. I worry this sorting would mess up my nicely coalesced memory accesses, by effectively randomizing the order in which global memory is accessed. Any thoughts?

If I were to go the shared memory route (the approach I had in mind), can anyone suggest how would I go about resolving the conflicts in setting the highest score?

Thanks.

These variables score and highest score are in registers anyway. If compScore in not alone functions in kernel, I doubt anything could be done.

To clarify the problem: I assume [font=“Courier New”]compScore()[/font] has an implicit dependence on [font=“Courier New”]threadIdx[/font]? And will [font=“Courier New”]highestScore[/font] and [font=“Courier New”]highestIndex[/font] be subject to a reduction over all threads?

Yes.

Yes, roughly. In fact the reduction will generally happen over a subset of threads, and there will be several reductions in parallel. For example, if there are 32 threads and threads 1 and 2 have the largest num, at the end half the threads will be computing the score for thread 1, and half for thread 2.

Hi all

“reduction” was the hint I needed - thanks tera. It led me to the whitepaper here:
[url=“http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf”]http://developer.download.nvidia.com/compu...c/reduction.pdf[/url]
which has given me enough ideas to work something out.

Thanks for all the help.

Could still do coalesced transfers from global to shared, I dont think uncoalesced from shared is an issue.

So
// calc num
// sort on num
somethingId = sortedNum[threadId];
for (int a = 0; a < num; ++a) {
//coalesced transfer from global to shared i.e.
sharArray[threadId] = globArray[offset+threadId];
//
localVar = sharArray[somethingId];
do costly function
}