Cooperative groups are much slower than CUB

I suspect the discrepancy is related to the size. If I change the block size to 32, then the cg method becomes ~4x faster rather than ~4x slower. (In that case, CUB might catch up if we switched from block reduce to warp reduce).

AFAIK, the largest “hardware-accelerated” tile for cg is a tile size of 32 (templated, known at compile time). While you can obviously use a larger tile size, I don’t know what decisions/implementation cg is using in that case, and it may not be “best”. Most of the examples I see in the programming guide pick a tile size of 32.

I agree that it would be nice if CUB and cg were comparable in performance, generally. You’re welcome to file a bug.