Inconsistent results for reduction, except while printf or cudamemcheck

Really ? As beginner it’s realy complicated to know if our code is optimized or not, informations aren’t stocked at the same place … What are these requirements and compilations ? Because, if it’s sequential i could have just done a for loop sequencaly and sum ? …

Thanks.

You may wish to re-read this question that I already linked:

[url]sorting - How to use Thrust to sort the rows of a matrix? - Stack Overflow

batched reduction:
http://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html
https://nvlabs.github.io/moderngpu/segreduce.html

Yeah, I was surprised to see you calling your reduction from inside a kernel. I assumed you wanted to call thrust::reduce from the CPU context where it’ll launch a properly configured kernel to give you the reduction you desire.

Problem is i don’t want to come back in a cpu context just to do a reduction, involving memory copy … But thanks to all of you for your answers.

Edit
My bad, i can go back in CPU context using a thrust::device_ptr without loosing time on copy … I’ll edit the solution.

Second Edit
BUT, using the reduction on the host size make me use a copy to give the result to the device …

I think you can use a primary kernel that only runs one thread and then dispatches various other kernels using dynamic parallelism. I think you can use that to avoid device-to-host-to-device copies. I could be wrong on this though, this is just an idea.

i don’t think so. f.e. in CUB: cub::DeviceSegmentedReduce Struct Reference all pointers/iterators should be device-accessible, f.e. cudaMalloced

Yes, but reduce only return the result. I can’t point this result to the device memory directly from host.

batched reduce calculates multiple results stored starting at the OutputIteratorT d_out:

http://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#ad9b73f245930740c4d8786fc1a812364

d_out usually points to the array in device memory (f.e. allocated with cudaMalloc)

thrust::reduce returns its result to the host

thrust::reduce_by_key can be used to return the result to the device

likewise, cub has various options (which BulatZiganshin is pointing out)