optimization for atomic operation

I would like to find a maximum and a minimum number of a sequence of numbers, working space is 10241024 (each thread handle one number,10241024 threads in total). I use atomic operation, block size if 16*16. it takes about 10 ms (GTX 570)to complete, which is not tolerable.
I wonder if there is another solution or if there is an optimization to do it.

PS: DownSampling or not using atomic operation would render the result bad.

You should not need atomic operations for finding minimum/maximum of an array of numbers - this is reduction operation, and probably the easiest way (it should be one line of code) to get the job done is just to use Thrust.

I appreciate your answer. However,it is that each thread has its own group id which has no direct relation with its thread id and for each group I would like to know its threads’ minimum/maximum number. In other words, 1024*1024 threads are grouped into several groups, I would like to know each group’s minimum/maximum number.

There are changeable size of threads in each group.

So this is more like a histogram then, not a reduction. There is some material about histograms in the SDk as well.
How many different group ids are there?

Then check thrust::reduce_by_key().

The group ids’s number are more than 5 thousand. I have scanned the histograms code in SDK briefly, which uses shared memory to accelerate memory access. But in my situation, shared memory might not big enough for every group.

Thanks for you advice. Do you mean to override binary op to justify it ? I might have troubles in doing it. Can you give me some guidance.

My code is like this:

texture<float4, 2, cudaReadModeElementType> Tex; //1024*1024 texture

kernel(parameter p ,vector vec)


 int x = thread.x;

 int y = thread.y;

value groupId = Tex(x,y);

 value result = compute(x,y,p);




You would start with your “groupID” stored in GPU memory, probably in an ordinary array and not texture. Then you would allocate another array for your “result”-s, and calculate these (eventually through running a kernel to do that). Then you would run thrust::reduce_by_key() twice (once to calculate minimum, and then to calculate maximum), using your “groupID”-s as keys, and your “result”-s as values data, and specifying thrust::minimum<>() and thrust::maximum() as binary operator, respectively.

Note that using the Thrust-solution requires sorting the data wrt. group-ids, as reduce-by-key reduces only consecutive sets of keys.

I tried this problem with my histogram-algorithm (https://github.com/trantalaiho/Cuda-Histogram) with 5000 bins and randomly distributed keys and got it done in 4.5ms with Tesla M2070 (ECC on), which should be a tad slower than your Geforce 570.

At about 2x faster it’s still not very fast (only 234 MKeys/s - I haven’t optimized much the large-histogram cases) and thrust-sort followed by reduce_by_key might be faster with such a large number of bins. Maybe sorting wrt. group-index and value simultaneously and just finding the results at the borders of the result-array could also be a competitive solution.

I’m attaching the sample code in case you are interested in trying it out (just compile with nvcc -O4 -arch=sm_20 test_find_minmax_group.cu and run with ./a.out --rnd --print)
cuda_histogram.h (115 KB)
test_find_minmax_group.cu (12.7 KB)