Reduction algorithm and matrices How to apply the SDK reduction algorithm to a half-matrix

[font=“Arial”][font=“Arial”]I apologize for repeating a question that I posted in the “Programming and Development” forum. But since I didn’t get any response in that forum I wonder if this is a better place to post it.

[/font][/font][font=“Arial”][font=“Arial”][font=“Arial”]I have a relatively large matrix (7000x7000 floats) with only the upper half occupied. I need to find the minimum element and its index for each column, and then the minimum of all the minimums (I need to know the index as well as value). I’ve looked at the SDK scan and reduction routines, but am not experienced enough to know how to apply them efficiently.

My initial approach is to create 7000 blocks (one for each column) and find the minimum for each. Then with a second kernel find the minimum of the results. To avoid idle threads in the lower half blocks (where columns are mostly empty) I thought to let the threads in the lower blocks “help” the upper blocks by working on some of their elements, and then merge their results with the results of the local threads. But it’s messy and complicated and I’m not sure if it’ll be efficient.

Can anyone please give me some guidance as to how to approach this? For my given data size what size grid and block should I use? How should I organize the data efficiently, and how should I address the issue of not being a power of two? I have searched the forum for related discussions and have read many of the posts, but still haven’t found anything that answers my questions. Thanks very much for your help.



Often times, the best way to optimize your kernels is just by a bit of trial and error, and by using the visual profiler. I believe that the nbody simulation example in the SDK achieved very good results over a triangular matrix (symmetric, so they only needed the upper triangular part), perhaps you can take a look at their code and get some ideas from it.

As far as the power-of-two situation goes, there’s a few options…you could either pad the matrix to the next power of 2 (8192), or you could do an if statement in the kernel that checks the thread index against an argument value for the size of the matrix, then just doesn’t do anything if it’s over the ‘end’ of the matrix (I think the reduction does this, perhaps in one of the earlier steps?). A third option, if your kernel can handle chunks of the matrix (say, 80 columns to a block), you could always do the first ‘partial’ chunk on the cpu (since it will only have a very small amount of elements anyway), and then the rest on the GPU.

I have some more to write, but I’m pretty tired. I’ll finish my thoughts in the morning…

[font=“Verdana”][font=“Verdana”]Thanks very much for your input – I’m looking forward to the rest of your thoughts when you have some time and energy!

I will look at the nbody simulation. Doing some of the work in CPU is a good idea, though it’s a bit complicated for my application, because each element of the matrix will interact with the rest of the matrix. I can’t easily isolate the lower part of the matrix from the rest (without a complex algorithm which I’m not up to).

I’m basically implementing a hierarchical clustering algorithm, where I have a one dimensional array of elements (each element is an array of different variables), and I have to create a half-matrix that represents the “distance” between each two elements. If you have any more general tips for this project I appreciate it.