Reduce question small reduce to avoid memcpyDeviceToHost

Hello everyone, I’ve been using CUDA for sometime but I’m still far from familiar with it. Here’s my simple but annoying problem: in my program there’s a function that does a reduce job. It finds the max of N float numbers(N<500), and returns its index. The insteresting part is that this N numbers are not in one continuous block, but are in M blocks scattered in a much larger array(—like the central rectangle of a bigger square).

The original version copies those numbers back to host and to all the sorting on CPU, but the time consumption of memcpyDeviceToHost is unacceptably high.

Since a global kernel must be void type, it needs some explicit memcpy operation to get the index variable back to host, which is just what I’m trying to avoid. Also, typical reduction APIs that returns the index, like cublasIsmax, is not fit for this problem, becuase I can’t afford to call it repeatedly. So, I really need some suggestions. 3x to all:)

What about doing memcpy asynchronously?

I am not aware of a possibility of returning value.

I am also doing a scan and then returning the single value.

My memory is consecutive, so there is only one memcpy at the end.

Due to the fact that it is a single small memcpy for the whole data, it does

not cost any time.

Another idea: what about doing max, and then calling a function which gathers all results and does memcpy at once ?

Unfurtunately for my app, the outmost loop repeats hundreds of times, and this device-to-host memcpy is the only “significant” copy. It uses up like 20% of my entire execution time.

So I’m basically looking for a scan operation with a return value, so that I can get rid of a separate copy afterwards.

But you’re right, I’m going to rearrange the code and try cudaMemcpyAsync, see if it makes a difference. 3x for your remind:)

Maybe you can perform a scan first to get the indices you need to perform your reduction on?
A reduction of <500 values can be done in 1 block then. If you take 256 threads, you let each thread take the maximum of input[intput_index[threadIdx.x]], input[intput_index[threadIdx.x+256]] (only when threadIdx.x+256 < N), otherwise just take input[intput_index[threadIdx.x]] as the max.
Then perform the standard reduction on shared memory.

It will use just 1 block, but it’s the fastest you can get.

Don’t perform the reduction on the GPU. You have to copy back the final value to the host anyways. And if you copy 500 or 1 value isn’t much of a time difference. Look:

  • 1 kernel call + 1 memcopy of one float
    vs.
  • 1 memcopy of 500 floats + maxsearch on CPU (VERY fast)

Like you can see from the reduction example and cublasmax/min kernels, it’s best to not issue a sec. kernel call (since the overhead) but instead transfer few hundreds of floats to the host and perform the final reduction on the CPU…