Would like to share my speedy reduction code Very simple code!

Hey Guys!

The other day i thought to myself that reduction need not be too complicated so i sat down and wrote some simple codes.

I quite soon got better results than even the bandwidth test and noticed that i was getting closer to the peak than both Mark Harris implementation and just slightly better than the implementation posted by cuda2010 a few months ago.



GPU-sum: 16777216.000

CPU-sum: 16777216.000

GPU-time: 113649 usec

Throughput: 59.049 GB/s


cuda2010's code:

Use device 0: "GeForce GTS 250"

Measured bandwidth=58.0GB/s

Reduce size=122880000, use time=8.517000ms, GPU_result=552985938, CPU_result=552985938


I used cuda2010's suggested settings and ran it, notice that we don't have the same reduce size (his is larger).

About the code, it utilizes the fact that my problem size is a multiple of the block size which makes this an unfair comparison agains Mark Harris general code.

I do however believe that one can combine these unrolled kernels up until the nearest multiple of the number of elements and finish of the last few elements with a kernel that takes in a dynamic variable.

This should take a slight hit on performance while offering a general solver. What do you guys think?

Here is the code, it is very simple:

compiled with: nvcc my_reduction.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math 

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <math.h>

#include <unistd.h>

#include <sys/time.h>

unsigned long long int get_clock()


	struct timeval tv;

	gettimeofday(&tv, NULL);

	return (unsigned long long int)tv.tv_usec + 1000000*tv.tv_sec;


__device__ void warp_reduce( float smem[64])



		smem[threadIdx.x] += smem[threadIdx.x + 16]; __syncthreads();

		smem[threadIdx.x] += smem[threadIdx.x + 8]; __syncthreads();

		smem[threadIdx.x] += smem[threadIdx.x + 4]; __syncthreads();

		smem[threadIdx.x] += smem[threadIdx.x + 2]; __syncthreads();

		smem[threadIdx.x] += smem[threadIdx.x + 1];__syncthreads();


		smem[threadIdx.x] += smem[threadIdx.x + 16];  

		smem[threadIdx.x] += smem[threadIdx.x + 8];  

		smem[threadIdx.x] += smem[threadIdx.x + 4];  

		smem[threadIdx.x] += smem[threadIdx.x + 2];  

		smem[threadIdx.x] += smem[threadIdx.x + 1]; 



template<int els_per_block, int threads>

__global__ void reduce(float* in, float* out)


	__shared__ float smem[64];

	int tid = threadIdx.x + __mul24(blockIdx.x, els_per_block);

	float sum = 0.0f;

	#pragma unroll

		for(int i = 0; i < els_per_block/threads; i++)


			sum += in[tid + __mul24(i,threads)];


	smem[threadIdx.x+32] = 0.0f;

	smem[threadIdx.x] = sum;



	if(threadIdx.x == 0)

		out[blockIdx.x] = smem[threadIdx.x]; // out[0] == ans


float cpu_sum(float* in, int num_els, int nreps)


	float sum = 0.0f;

	for(int i = 0; i < num_els; i++)


		sum += in[i];


	return sum;


unsigned long long int my_reduction_test()


	// timers

	unsigned long long int start;

	unsigned long long int delta;

	const int threads = 32;

	const int num_els = 2048*2048*4;

	const int begin_els_per_block = num_els/128;

	int size = num_els*sizeof(float);

	float* d_in;

	float* d_out;

	float* d_warm1;

	float* d_warm2;

	float* in = (float*)malloc(size);

	float* out = (float*)malloc(size);

	for(int i = 0; i < num_els; i++)


		in[i] = 1.0f;//float(rand())/float(RAND_MAX);


	cudaMalloc((void**)&d_in, size);

	cudaMalloc((void**)&d_out, size);

	cudaMalloc((void**)&d_warm1, 1024*sizeof(float));

	cudaMalloc((void**)&d_warm2, 1024*sizeof(float));

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	printf("\n size: %0.3f MB \n", 2.0f*float(size)/(1024.0f*1024.0f));




	/// warmup


	reduce<32,threads><<< 32, 32>>>(d_warm1, d_warm2); 



	// end warmup


	//time it

	start = get_clock();


	// real reduce


	for(int i = 0; i < 100; i++)

		reduce<begin_els_per_block,threads><<< num_els/begin_els_per_block, 32>>>(d_in, d_out); 

	for(int i = 0; i < 100; i++)	

		reduce<num_els/begin_els_per_block, threads ><<< 1, 32 >>>(d_out, d_in); // sum up the last few values	



	delta = get_clock() - start;

	float dt = float(delta)/100.0f;

	cudaMemcpy(out, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU-sum: %0.3f \n", out[0]);

	printf("\n CPU-sum: %0.3f \n", cpu_sum(in,num_els,100));

	printf("\n GPU-time: %llu usec \n", delta);

	float throughput = num_els*sizeof(float)*0.000000001f/(dt*0.000001);

	printf("\n Throughput: %0.3f GB/s \n", throughput);

	printf("\n %s \n", cudaGetErrorString(cudaGetLastError()));







	return delta;


int main(int argc, char* argv[])




	return 0;


I would be happy to hear your results on bigger hardware for bigger problem sizes. I mean it could be that my code isn’t very effective on “bigger iron”. I imagine that one would want some paramter tweaking to achieve optimal results. I’m sorry to say i only have a GTS 250 to run from home :/

Ok, I hope this might come to some use!