I was working on parallelizing a neural network training algorithm and in testing I noticed that the results in the original version and in the GPU version were slightly different - CPU would learn the same training set using the same algorithm in less iterations than GPU.
Maybe my GPU implementation of the training algorithm is somehow faulty, but nevertheless I started wondering what made the difference, so I made this little test program that simply does nothing but some float number calculations and shows results. (Nevermind the excessive usage of global memory in this example, as the only purpose of it is to see how floating point calculations work out)
Here is the full source code:
#include <stdio.h>
// Float calculation test function for Device
__global__ void floatTestDevice(float *d_A, float *d_B, float *d_C) {
int tx = threadIdx.x;
for (int i = 0; i < 10; i++) {
d_C[tx] += d_A[tx] * d_B[tx];
}
}
// Float calculation test function for Host
void floatTestHost(float *h_A, float *h_B, float *h_C, int amount) {
for (int tx = 0; tx < amount; tx++) {
for (int i = 0; i < 10; i++) {
h_C[tx] += h_A[tx] * h_B[tx];
}
}
}
int main() {
// Declare variables A, B and C for both Host and Device
float *h_A, *h_B, *h_C, *h_dC; // h_dC will contain d_C in Host memory
float *d_A, *d_B, *d_C;
int amount = 10; // Size of arrays and num of iterations for loop
// Allocate space for Host and Device variables
h_A = (float*)malloc(sizeof(float)*amount);
h_B = (float*)malloc(sizeof(float)*amount);
h_C = (float*)malloc(sizeof(float)*amount);
h_dC = (float*)malloc(sizeof(float)*amount);
cudaMalloc((void**) &d_A, sizeof(float)*amount);
cudaMalloc((void**) &d_B, sizeof(float)*amount);
cudaMalloc((void**) &d_C, sizeof(float)*amount);
// Set some float values to calculate (just random)
for (int i = 0; i < amount; i++) {
h_A[i] = 5.125436/(i+1);
h_B[i] = 8.234534/(i+1);
h_C[i] = 0;
}
// Copy initial values from Host to Device
cudaMemcpy(d_A, h_A, sizeof(float)*amount, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float)*amount, cudaMemcpyHostToDevice);
cudaMemcpy(d_C, h_C, sizeof(float)*amount, cudaMemcpyHostToDevice);
// Run test on Device and copy result from Device to Host
floatTestDevice<<< 1, amount >>>(d_A, d_B, d_C);
cudaMemcpy(h_dC, d_C, sizeof(float)*amount, cudaMemcpyDeviceToHost);
// Run test on Host
floatTestHost(h_A, h_B, h_C, amount);
// Print and compare results
printf("Results: CPU -- GPU\t[difference]\n");
for (int i = 0; i < amount; i++) {
printf("%f %s %f ", h_C[i], ((h_C[i] == h_dC[i])?"==":"!="), h_dC[i]);
if (h_C[i] != h_dC[i]) {
printf("\tdiff: %f", (h_C[i]-h_dC[i]));
}
printf("\n");
}
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
free(h_A); free(h_B); free(h_C); free(h_dC);
return 0;
}
The difference in CPU and GPU answers is surprisingly big taking into account the small number of float multiplication and addition calculations performed:
Results: CPU -- GPU [difference]
422.055756 != 422.055695 diff: 0.000061
105.513939 != 105.513924 diff: 0.000015
46.895081 == 46.895081
26.378485 != 26.378481 diff: 0.000004
16.882231 == 16.882231
11.723770 == 11.723770
8.613382 == 8.613382
6.594621 != 6.594620 diff: 0.000001
5.210565 == 5.210565
4.220558 == 4.220558
So I figured that with my NN algorithm, which performs hundreds of times more float calculations than this example, the difference in the results gets pretty wild.
Is there something wrong with my code or do we just have to live with the hardware differences?