@Robert_Crovella Thank you for the quick response. I have found many of your answers on this forum very insightful.
That is an excellent point. I updated the programs to use the results, given below. The difference is similar (2.3 times as long) except now there is modulus overhead.
Method |
Runtime |
Cooperative groups |
50.865s |
CUB |
22.440s |
I also observed a similar slowdown when switching a larger program from using CUB to cooperative groups.
The guide states “reduce […] takes advantage of hardware acceleration […] for the arithmetic add, min, or max operations and the logical AND, OR, or XOR”.
My interpretation of this quote is that cooperative groups use a hardware accelerated reduction that, in theory, cannot be beat. At best, I would expect CUB to match this, yet CUB is much faster.
In any case, it is surprising that there is a discrepancy between the two methods at all. Both have access to the same features and both know the block size at compile time. The only difference I know of is that CUB uses shared memory.
Thank you for sharing this useful tool. I mean to ask from a user’s perspective:
- Why am I getting poorer performance with cooperative groups? Specifically, am I using it incorrectly?
- Given that CUB is faster, should I prefer it where possible?
I agree. For now, I am focused only on block reduce which both are capable of.
Thank you for mentioning the improvement. The updated test cases are below.
Block reduce with cooperative groups
#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
namespace cg = cooperative_groups;
constexpr auto BLOCK_SIZE{1024};
constexpr auto ITERATIONS{64 * 1024 * 1024};
__global__ void reduce_kernel(unsigned &result) {
const auto block{cg::this_thread_block()};
const auto tile{cg::tiled_partition<BLOCK_SIZE>(block)};
const auto lane{tile.thread_rank()};
auto value{0u};
for (auto i{0}; i < ITERATIONS; ++i) {
const auto thread_value{i % lane};
const auto increment{cg::reduce(tile, thread_value, cg::plus<unsigned>())};
value = increment % value;
}
invoke_one(tile, [&] { result = value; });
}
int main() {
unsigned *result;
cudaMallocManaged(&result, sizeof *result);
reduce_kernel<<<1, BLOCK_SIZE>>>(*result);
cudaDeviceSynchronize();
// Prints "result: 139".
printf("result: %u\n", *result);
return 0;
}
Block reduce with CUB
#include <cub/block/block_reduce.cuh>
constexpr auto BLOCK_SIZE{1024};
constexpr auto ITERATIONS{64 * 1024 * 1024};
__global__ void reduce_kernel(unsigned &result) {
const auto lane{threadIdx.x};
auto value{0u};
for (auto i{0}; i < ITERATIONS; ++i) {
const auto thread_value{i % lane};
const auto increment{
cub::BlockReduce<unsigned, BLOCK_SIZE>{}.Sum(thread_value)};
value = increment % value;
}
if (threadIdx.x == 0) {
result = value;
}
}
int main() {
unsigned *result;
cudaMallocManaged(&result, sizeof *result);
reduce_kernel<<<1, BLOCK_SIZE>>>(*result);
cudaDeviceSynchronize();
// Prints "result: 139".
printf("result: %u\n", *result);
return 0;
}