Hello everyone, I’ve been using CUDA for sometime but I’m still far from familiar with it. Here’s my simple but annoying problem: in my program there’s a function that does a reduce job. It finds the max of N float numbers(N<500), and returns its index. The insteresting part is that this N numbers are not in one continuous block, but are in M blocks scattered in a much larger array(—like the central rectangle of a bigger square).

The original version copies those numbers back to host and to all the sorting on CPU, but the time consumption of memcpyDeviceToHost is unacceptably high.

Since a **global** kernel must be void type, it needs some explicit memcpy operation to get the index variable back to host, which is just what I’m trying to avoid. Also, typical reduction APIs that returns the index, like cublasIsmax, is not fit for this problem, becuase I can’t afford to call it repeatedly. So, I really need some suggestions. 3x to all:)