Hey Guys!
The other day i thought to myself that reduction need not be too complicated so i sat down and wrote some simple codes.
I quite soon got better results than even the bandwidth test and noticed that i was getting closer to the peak than both Mark Harris implementation and just slightly better than the implementation posted by cuda2010 a few months ago.
Printouts:
my_reduction
GPU-sum: 16777216.000
CPU-sum: 16777216.000
GPU-time: 113649 usec
Throughput: 59.049 GB/s
--------------------------------
cuda2010's code:
Use device 0: "GeForce GTS 250"
Measured bandwidth=58.0GB/s
Reduce size=122880000, use time=8.517000ms, GPU_result=552985938, CPU_result=552985938
Test PASSED
I used cuda2010's suggested settings and ran it, notice that we don't have the same reduce size (his is larger).
About the code, it utilizes the fact that my problem size is a multiple of the block size which makes this an unfair comparison agains Mark Harris general code.
I do however believe that one can combine these unrolled kernels up until the nearest multiple of the number of elements and finish of the last few elements with a kernel that takes in a dynamic variable.
This should take a slight hit on performance while offering a general solver. What do you guys think?
Here is the code, it is very simple:
compiled with: nvcc my_reduction.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <math.h>
#include <unistd.h>
#include <sys/time.h>
unsigned long long int get_clock()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (unsigned long long int)tv.tv_usec + 1000000*tv.tv_sec;
}
__device__ void warp_reduce( float smem[64])
{
#if __DEVICE_EMULATION__
smem[threadIdx.x] += smem[threadIdx.x + 16]; __syncthreads();
smem[threadIdx.x] += smem[threadIdx.x + 8]; __syncthreads();
smem[threadIdx.x] += smem[threadIdx.x + 4]; __syncthreads();
smem[threadIdx.x] += smem[threadIdx.x + 2]; __syncthreads();
smem[threadIdx.x] += smem[threadIdx.x + 1];__syncthreads();
#else
smem[threadIdx.x] += smem[threadIdx.x + 16];
smem[threadIdx.x] += smem[threadIdx.x + 8];
smem[threadIdx.x] += smem[threadIdx.x + 4];
smem[threadIdx.x] += smem[threadIdx.x + 2];
smem[threadIdx.x] += smem[threadIdx.x + 1];
#endif
}
template<int els_per_block, int threads>
__global__ void reduce(float* in, float* out)
{
__shared__ float smem[64];
int tid = threadIdx.x + __mul24(blockIdx.x, els_per_block);
float sum = 0.0f;
#pragma unroll
for(int i = 0; i < els_per_block/threads; i++)
{
sum += in[tid + __mul24(i,threads)];
}
smem[threadIdx.x+32] = 0.0f;
smem[threadIdx.x] = sum;
warp_reduce(smem);
if(threadIdx.x == 0)
out[blockIdx.x] = smem[threadIdx.x]; // out[0] == ans
}
float cpu_sum(float* in, int num_els, int nreps)
{
float sum = 0.0f;
for(int i = 0; i < num_els; i++)
{
sum += in[i];
}
return sum;
}
unsigned long long int my_reduction_test()
{
// timers
unsigned long long int start;
unsigned long long int delta;
const int threads = 32;
const int num_els = 2048*2048*4;
const int begin_els_per_block = num_els/128;
int size = num_els*sizeof(float);
float* d_in;
float* d_out;
float* d_warm1;
float* d_warm2;
float* in = (float*)malloc(size);
float* out = (float*)malloc(size);
for(int i = 0; i < num_els; i++)
{
in[i] = 1.0f;//float(rand())/float(RAND_MAX);
}
cudaMalloc((void**)&d_in, size);
cudaMalloc((void**)&d_out, size);
cudaMalloc((void**)&d_warm1, 1024*sizeof(float));
cudaMalloc((void**)&d_warm2, 1024*sizeof(float));
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
printf("\n size: %0.3f MB \n", 2.0f*float(size)/(1024.0f*1024.0f));
//////////
/// warmup
//////////
reduce<32,threads><<< 32, 32>>>(d_warm1, d_warm2);
cudaThreadSynchronize();
/////
// end warmup
/////
//time it
start = get_clock();
//////////////
// real reduce
/////////////
for(int i = 0; i < 100; i++)
reduce<begin_els_per_block,threads><<< num_els/begin_els_per_block, 32>>>(d_in, d_out);
for(int i = 0; i < 100; i++)
reduce<num_els/begin_els_per_block, threads ><<< 1, 32 >>>(d_out, d_in); // sum up the last few values
cudaThreadSynchronize();
delta = get_clock() - start;
float dt = float(delta)/100.0f;
cudaMemcpy(out, d_in, size, cudaMemcpyDeviceToHost);
printf("\n GPU-sum: %0.3f \n", out[0]);
printf("\n CPU-sum: %0.3f \n", cpu_sum(in,num_els,100));
printf("\n GPU-time: %llu usec \n", delta);
float throughput = num_els*sizeof(float)*0.000000001f/(dt*0.000001);
printf("\n Throughput: %0.3f GB/s \n", throughput);
printf("\n %s \n", cudaGetErrorString(cudaGetLastError()));
cudaFree(d_in);
cudaFree(d_out);
cudaFree(d_warm1);
cudaFree(d_warm2);
free(in);
free(out);
return delta;
}
int main(int argc, char* argv[])
{
my_reduction_test();
//do_other_test();
return 0;
}
I would be happy to hear your results on bigger hardware for bigger problem sizes. I mean it could be that my code isn’t very effective on “bigger iron”. I imagine that one would want some paramter tweaking to achieve optimal results. I’m sorry to say i only have a GTS 250 to run from home External Image
Ok, I hope this might come to some use!