Cooperative groups are much slower than CUB

Robert_Crovella · November 7, 2024, 4:33pm

I suspect the discrepancy is related to the size. If I change the block size to 32, then the cg method becomes ~4x faster rather than ~4x slower. (In that case, CUB might catch up if we switched from block reduce to warp reduce).

AFAIK, the largest “hardware-accelerated” tile for cg is a tile size of 32 (templated, known at compile time). While you can obviously use a larger tile size, I don’t know what decisions/implementation cg is using in that case, and it may not be “best”. Most of the examples I see in the programming guide pick a tile size of 32.

I agree that it would be nice if CUB and cg were comparable in performance, generally. You’re welcome to file a bug.

Topic		Replies	Views
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12600	February 7, 2023
My reduction code is not really fast.. CUDA Programming and Performance	0	8681	April 11, 2011
cuda code much slower than Cg version CUDA Programming and Performance	3	2346	February 17, 2008
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6544	April 13, 2009
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1105	July 29, 2010
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4122	October 15, 2008
sequential sum within a kernel. CUDA Programming and Performance	23	5075	September 8, 2008
Simple/1st CUDA program: Reverse bits in byte Why is it faster on the CPU? CUDA Programming and Performance	11	7228	December 6, 2007
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6736	June 2, 2012
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30479	October 29, 2010

Cooperative groups are much slower than CUB

Related topics