Hello, I need some help understanding why the following program gives erroneous results. I was working on a lengthy program when a change I made started giving very odd results. I couldn’t find any fault in the logic so I began testing. I removed all extraneous code until it was reduced to the following very simple program. You will notice that the kernel only runs one thread so there is no chance of unexpected parallel interactions occurring. Also, notice that there is a lot of weird and unnecessary code in the kernel. Those lines were left as they are because changing them causes the code to produce the correct result (I have no idea why that is). In fact, making any ONE of the following changes will cause this program to give a correct answer:
[list=1]
[*]Uncomment the last __syncthreads()
[*]Remove the first __syncthreads()
[*]Remove either part of the if statement
[*]Replace any threadIdx.x with 0 (which is what it evaluates to anyway)
[*]Swap the order of the two lines of code that assign values to localHasPrimes and localHasPi
[*]Swap the order of the two lines of code that assign values to hasPi and hasPrimes
[*]Remove the last assignment for hasPrimes
[*]Assign a constant value to hasPrimes instead
[*]Assign a constant value to the temp variable
Essentially any modification to the code that is already written within the kernel will cause it to behave correctly. However, it seems that it doesn’t affect the inccorect result when you insert new code, add variables, loop on some of the already existing code, move some of the code into functions, increase the size and number of arrays, and change the number of threads and blocks.
#include <stdio.h>
#include <cuda.h>
__global__ void kernel(float *, float *);
int main(void)
{
float *hasPrimes, *hasPi, hostHasPrimes[1], hostHasPi[1];
hostHasPrimes[0] = 2357;
cudaMalloc(&hasPrimes, sizeof(float));
cudaMalloc(&hasPi, sizeof(float));
cudaMemcpy(hasPrimes, hostHasPrimes, sizeof(float), cudaMemcpyHostToDevice);
kernel <<<1, 1>>> (hasPrimes, hasPi);
cudaThreadSynchronize();
cudaMemcpy(hostHasPi, hasPi, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(hostHasPrimes, hasPrimes, sizeof(float), cudaMemcpyDeviceToHost);
printf("Primes: %f\n", hostHasPrimes[0]);
printf("Pi: %f\n", hostHasPi[0]);
cudaFree(hasPrimes);
cudaFree(hasPi);
printf("\nPress <ENTER> to exit.\n");
getchar();
return 0;
}
__global__ void kernel(float *hasPrimes, float *hasPi)
{
__shared__ float localHasPrimes[1];
__shared__ float localHasPi[1];
float tempThatDoesntMatter;
localHasPrimes[0] = hasPrimes[0];
localHasPi[threadIdx.x] = 3.141593;
if (threadIdx.x)
localHasPrimes[threadIdx.x] = 111317;
__syncthreads();
tempThatDoesntMatter = hasPrimes[0];
hasPi[0] = localHasPi[threadIdx.x];
hasPrimes[0] = tempThatDoesntMatter;
//__syncthreads();
}
The correct result should be:
Primes: 2357.000000
Pi: 3.141593
Press <ENTER> to exit.
However, this is printed instead:
Primes: 2357.000000
Pi: 2357.000000
Press <ENTER> to exit.
This code has been run and verified on two different machines. They have different graphics cards but both tests were done on a Windows 7 x64 machine using Visual Studio 2008 and CUDA Driver version 3.2. The error only occurs when the code is compiled to architecture sm_20. I believe that is all the information that I have. Any all help will be greatly appreciated. I really want to know just what on earth is going on with this program. Thanks!