I need to find the largest value among all the threads results. Suppose each thread comes up with a float value as a result, what is the best way to find out which one is the largest? Shoudl I transfer all data to CPU and let CPU to handle this? I noticed it is very time consuming to copy data from device memory back to CPU.
A follow-up question is: if I do find this largest value among all threads, and want to return this value to CPU. Do I still have to allocate one float space in device memory and copy it from device memory to host memory? Is there a faster way?
I think it depends on what you want to do with it? If you never want to use it on the CPU you don’t have to transfer it back to the host. And for your first question. I think it is very hard to find the largest value inside the kernel because than the threads needs to depend on each other. And how I see it, that is something you don’t want to have, right?
p1 = THREADS; // keep the middle index
while (p1 > 1)
{
p1 = rintf(p1 * 0.5f); // divided by 2, rounded to the nearest integer
if( (threadIdx.x < p1) && (s_solutions[threadIdx.x] < s_solutions[threadIdx.x+p1]) )
{
// push the bigger element to the first half
s_solutions[threadIdx.x] = s_solutions[threadIdx.x+p1];
}
}
As a result, s_solutions[0] will be the largest value.