How to realize that


I’m just heading a little problem.

I’ve got a giant array of float values, and i need to find the highes value in it. Thats maybe no problem, when using atomic functions, but how to make that fast enough?

Is it efficient to do that in cuda? (ok, i got to do it in cuda, doesn’t really matter if it makes sense)

my idea was the following:
(-every block stores it’s values in shared memory (lets say 3*512 float values))
-every block writes the current maximum in shared mem.
-every block writes the final maximum (of the block) to an indexed position of an global memory array
-the global memory array with the local maxima will be handled by the cpu (no big deal to find the max of some thousands of values)

is this plausible/smart? or is there any better way to do that on gpu?

Use a parallel reduction. The SDK contains example code and a white paper which describes the various algorithmic options and their relative merits and performance, and there are a number of implementations available in ready to use libraries - at least in Thrust, and maybe in CUDPP as well.

You’re lucky you only want to find the one global maximum.

I was having the problem of finding the N largest values per group out of K groups of floats. The larger N is, the more this becomes a sorting problem.

thanks :)

my strategy was allready something like a parallel reduction, but its good to know where its written down correctly (means: i found it!) :D

The first code example is EXACTLY what i wanted to do (except the result)…that saves time^^

where can i find the mentioned librarys, or is it maybe smarter, and more educational to try it on my own?

its actually like this: i get a bunch of float 3 values, and i got to find out the minimum and maximum of each coordinate. But i think thats solveable…early in the morning -.- Have a nice day! :P

Thrust and cudpp.

As for using float3 values, the idea should be the same as a scalar reduction, except that the reduction code need to be done for each scalar member of each float 3 read. The result would wind up being held in a float3 output variable. The only thing to keep in mind is that float3 performance isn’t as good as either float2 or float4, so you might want to look at what the memory bandwidth implications of the type are for your application. It might be negligible.

thank you, but i think i’ll have to stay with float3.

im just confused (too much coffee, too less beer)…can i cast a float pointer to a float3 pointer? or are the values in float3 stored in another way than just 3 floats in a row?

I wouldn’t try casting. I have run into some pretty bad problems where structs are aligned to 64-bit boundaries by default in nvcc. So float3 might actually be laid out like float float float blank float float float … etc… depending on the target.