Find the largest value among all threads result In CPU or GPU?

I need to find the largest value among all the threads results. Suppose each thread comes up with a float value as a result, what is the best way to find out which one is the largest? Shoudl I transfer all data to CPU and let CPU to handle this? I noticed it is very time consuming to copy data from device memory back to CPU.

Thank you,

A follow-up question is: if I do find this largest value among all threads, and want to return this value to CPU. Do I still have to allocate one float space in device memory and copy it from device memory to host memory? Is there a faster way?

I think it depends on what you want to do with it? If you never want to use it on the CPU you don’t have to transfer it back to the host. And for your first question. I think it is very hard to find the largest value inside the kernel because than the threads needs to depend on each other. And how I see it, that is something you don’t want to have, right?


Just take a look at the reduction sample in SDK, please.

But a reduction is not the largest value of the array but an Accumulated value. Or am I wrong?

Just change the reduction example to do max() instead of +, think about what the example is doing and you will see that it works.

Reduction approach works for any associative operator. Also, check out the CUDA Data Parallel Primitives (, there may already be a funtion that you need.


Thanks to all. Here is what I did:

p1 = THREADS; // keep the middle index

	while (p1 > 1)


  p1 = rintf(p1 * 0.5f);  // divided by 2, rounded to the nearest integer

  if( (threadIdx.x < p1) && (s_solutions[threadIdx.x] < s_solutions[threadIdx.x+p1]) )


  	// push the bigger element to the first half

  	s_solutions[threadIdx.x] = s_solutions[threadIdx.x+p1];  



As a result, s_solutions[0] will be the largest value.

Unfortunately CUDPP currently only has max-scan, not max-reduce. In fact it doesn’t have any reductions yet – it’s on our todo list!