# Different Results on Two Different Cards..

I have written a GPU application, which squares number 1-9 on the GPU. I have executed the CUDA program on two different machines; however, I get completely different results. I am using a Tesla C2075 device and a GeForce 9400M

Thanks,
Any help on this issue would be greatly appreciated.

Likely a precision issue. Since the 9400M doesn’t support double precision, perhaps you are just comparing its single-precision results with the Tesla’s double-precision results?

My results for the Tesla C2075 is

0 0.000000

1 1.000000

2 2.000000

3 3.000000

4 4.000000

5 5.000000

6 6.000000

7 7.000000

8 8.000000

9 9.000000

My results for the 9400M is

0 0.000000

1 1.000000

2 4.000000

3 9.000000

4 16.000000

5 25.000000

6 36.000000

7 49.000000

8 64.000000

9 81.000000

What is the code leading to these results?

Hrm, not a precision issue then. Posting your code would help.

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device

// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

// Cleanup
free(a_h); cudaFree(a_d);
}

Thank you!

So the code does not execute at all on the C2075. To notice when this happens, always check return codes of CUDA calls.
How do you compile your code?

For the squaring number application, it seems not to be executing. I compile my code with nvcc application.cu -o application. Also, I executed a simple hello world application on the C2075, and I got Hello Hello, when it should be of course, Hello World.

The code for it is:

#include <stdio.h>

const int N = 16;

const int blocksize = 16;

global

void hello(char *a, int *b)

{

``````a[threadIdx.x] += b[threadIdx.x];
``````

}

int main()

{

``````char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);

printf("%s", a);

cudaMalloc( (void**)&bd, isize );

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, 1 );

dim3 dimGrid( 1, 1 );

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );

printf("%s\n", a);

return EXIT_SUCCESS;
``````

}

Thanks

int main()

{

``````    char a[N] = "Hello \0\0\0\0\0\0";

int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
``````

``````    int *bd;

const int csize = N*sizeof(char);

const int isize = N*sizeof(int);
``````

printf("%s", a);

if (cudaSuccess != cudaMalloc( (void**)&ad, csize ) )

``````    printf("AD / Error!\n");
``````

if (cudaSuccess != cudaMalloc( (void**)&bd, isize ))

``````    printf("BD / Error!\n");
``````

cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );

cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, 1 );

``````    dim3 dimGrid( 1, 1 );

if (cudaSuccess != cudaGetLastError())

printf("Kernal - Error\n");
``````

cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );

``````    cudaFree( ad );
``````

printf("%s\n", a);

``````    return EXIT_SUCCESS;
``````

}

My output for this is: