I am a beginner programmer in CUDA. As a beginner exercise, I thought I’d follow http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf and write some reduction algorithms. What my program should do is exactly as follows:
It takes a large float array as input. The input array is simply an array starting at 0 and incrementing by 1.
It carries out a partial reduction on the array. The result should be 1/512 th of the initial array and each element of the result is the sum of 512 elements of the input array.
The main program does the same summation in the CPU and compares the results.
This is my code so far:
#include <stdio.h>
#include <math.h>
#define THREAD_COUNT 512
//-----------------this is basically the first reduction algorithm given in the reference----------
__global__ void gbest_power_sums1(float* result, float* powers){
__shared__ float sdata[THREAD_COUNT];
unsigned int tid = threadIdx.x;
long i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = powers[i];
__syncthreads();
unsigned int s;
for(s=1; s<blockDim.x; s *= 2){
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if(tid == 0)
result[blockIdx.x] = sdata[0];
}
int main(){
unsigned long input_size = 1<<18;
unsigned long output_size = input_size / THREAD_COUNT;
//------------prepare the input----------------------------------------------------------------
float* input;
float* input_dev;
input = (float*) malloc(sizeof(float) * input_size);
cudaMalloc(&input_dev, sizeof(float) * input_size);
long i;
for(i=0; i<input_size; i++){
input[i] = i;
}
cudaMemcpy(input_dev, input, sizeof(float) * input_size, cudaMemcpyHostToDevice);
//-----------------------------------------------------------------------------------------------
float* result;
float* result_dev;
result = (float*) malloc(sizeof(float) * output_size);
cudaMalloc(&result_dev, sizeof(float) * output_size);
//-------------------------carry out reduction----------------------------------------------------
gbest_power_sums1<<<output_size, THREAD_COUNT>>>(result_dev, input_dev);
cudaMemcpy(result, result_dev, sizeof(float) * output_size, cudaMemcpyDeviceToHost);
cudaError_t error = cudaPeekAtLastError();
if(error != cudaSuccess){
printf("%s\n", cudaGetErrorString(error));
exit(0);
}
//--------------------calculate the result in CPU for verification-------------------------------
int j;
for(i=0; i<output_size; i++){
float actual_result = 0.0f;
for(j=0; j<THREAD_COUNT; j++){
actual_result += input[i*THREAD_COUNT + j];
}
if(fabs(actual_result - result[i]) > 0.1){
printf("Error at i=%d, expected:%f, got:%f\n", i, actual_result, result[i]);
exit(0);
}
}
printf("test passed!\n");
return 0;
}
The code works but only up to the size of 2^17 of input array. When I try 2^18 results calculated by the GPU differs from that calculated by the CPU. The printed output I get from an array of size 2^18 is:
Error at i=65, expected:17170172.000000, got:17170176.000000
There is a difference of 4.0 here. I don’t understand why. Any help would be appreciated.