Inconsistent results for reduction, except while printf or cudamemcheck

Zenfhou · September 11, 2016, 8:13am

Really ? As beginner it’s realy complicated to know if our code is optimized or not, informations aren’t stocked at the same place … What are these requirements and compilations ? Because, if it’s sequential i could have just done a for loop sequencaly and sum ? …

Thanks.

Robert_Crovella · September 11, 2016, 12:14pm

You may wish to re-read this question that I already linked:

[url]sorting - How to use Thrust to sort the rows of a matrix? - Stack Overflow

BulatZiganshin · September 11, 2016, 12:48pm

batched reduction:
http://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html
https://nvlabs.github.io/moderngpu/segreduce.html

MutantJohn · September 11, 2016, 3:32pm

Yeah, I was surprised to see you calling your reduction from inside a kernel. I assumed you wanted to call thrust::reduce from the CPU context where it’ll launch a properly configured kernel to give you the reduction you desire.

Zenfhou · September 12, 2016, 9:35am

Problem is i don’t want to come back in a cpu context just to do a reduction, involving memory copy … But thanks to all of you for your answers.

Edit
My bad, i can go back in CPU context using a thrust::device_ptr without loosing time on copy … I’ll edit the solution.

Second Edit
BUT, using the reduction on the host size make me use a copy to give the result to the device …

MutantJohn · September 12, 2016, 5:55pm

I think you can use a primary kernel that only runs one thread and then dispatches various other kernels using dynamic parallelism. I think you can use that to avoid device-to-host-to-device copies. I could be wrong on this though, this is just an idea.

BulatZiganshin · September 12, 2016, 6:27pm

i don’t think so. f.e. in CUB: cub::DeviceSegmentedReduce Struct Reference all pointers/iterators should be device-accessible, f.e. cudaMalloced

Zenfhou · September 13, 2016, 10:12am

Yes, but reduce only return the result. I can’t point this result to the device memory directly from host.

BulatZiganshin · September 13, 2016, 10:21am

batched reduce calculates multiple results stored starting at the OutputIteratorT d_out:

http://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#ad9b73f245930740c4d8786fc1a812364

d_out usually points to the array in device memory (f.e. allocated with cudaMalloc)

Robert_Crovella · September 13, 2016, 2:37pm

thrust::reduce returns its result to the host

thrust::reduce_by_key can be used to return the result to the device

likewise, cub has various options (which BulatZiganshin is pointing out)

Topic		Replies	Views
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5061	May 7, 2015
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1930	March 1, 2011
matrix multiply reduction CUDA Programming and Performance	41	35533	January 15, 2011
help with kernel synchronization? CUDA Programming and Performance	22	13898	August 26, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Unified Memory Behavior... CUDA Programming and Performance	4	3401	August 11, 2014
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6891	July 14, 2010
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4397	February 7, 2012
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015

Inconsistent results for reduction, except while printf or cudamemcheck

Related topics