problem in the program running on CUDA

I am beginner in CUDA programming and I have a problem. I use the windows and made the installation of VS2013 and CUDA 6.5.
Created two sample programs to run them but the results are always zero (0). I did the program execution on the CPU and it worked but when adjusting for using the GPU the results are always zero (0).
What’s the problem?

Code:

#include <stdio.h>

#define SIZE 1024

global void VectorAdd(int *a, int *b, int *c, int n)
{
int i = threadIdx.x;

if (i < n)
	c[i] = a[i] + b[i];

}

int main()
{
int *a,*b,*c;
int *d_a, *d_b, *d_c;

a = (int *)malloc(SIZE*sizeof(int));
b = (int *)malloc(SIZE*sizeof(int));
c = (int *)malloc(SIZE*sizeof(int));

cudaMalloc( &d_a, SIZE*sizeof(int));
cudaMalloc( &d_b, SIZE*sizeof(int));
cudaMalloc( &d_c, SIZE*sizeof(int));


for (int i = 0; i < SIZE; ++i)
{
	a[i] = i;
	b[i] = i;
	c[i] = 0;
}

cudaMemcpy(d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice);

VectorAdd<<<1,SIZE>>>(d_a, d_b, d_c, SIZE);

cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost);

for (int i = 0; i < 10; ++i)
{
	printf("c[%d] - %d\n", i, c[i]);
}

free(a);
free(b);
free(c);

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;

}

RESULT:
c[0] - 0
c[1] - 0
c[2] - 0
c[3] - 0
c[4] - 0
c[5] - 0
c[6] - 0
c[7] - 0
c[8] - 0
c[9] - 0

THE EXPECTED RESULT:
RESULT:
c[0] - 0
c[1] - 2
c[2] - 4
c[3] - 6
c[4] - 8
c[5] - 10
c[6] - 12
c[7] - 14
c[8] - 16
c[9] - 18

Your code works correctly for me. You may have a machine setup issue, or a problem with how you are compiling.

Add proper cuda error checking to your code, then recompile and re-run it. The error output will be useful for discovering what the problem is.

If you don’t know what “proper cuda error checking” is, then google “proper cuda error checking” and take the first hit. Then apply that to your code.

I added the cudasafe method does not present any error in execution. I put on property of device but still not working. What else could you try?

#include <stdio.h>
#include
#include <cuda.h>

#define SIZE 1024

void cudasafe(cudaError_t error, char* message)
{
if (error != cudaSuccess) { fprintf(stderr, “ERROR: %s : %i\n”, message, error); exit(-1); }
}

global void VectorAdd(int *a, int *b, int *c, int n)
{
int i = threadIdx.x;

if (i < n)
	c[i] = a[i] + b[i];

}

int main()
{
int *a, *b, *c;
int *d_a, *d_b, *d_c;

a = (int *)malloc(SIZE*sizeof(int));
b = (int *)malloc(SIZE*sizeof(int));
c = (int *)malloc(SIZE*sizeof(int));

cudasafe(cudaMalloc(&d_a, SIZE*sizeof(int)), "cudaMalloc");
cudasafe(cudaMalloc(&d_b, SIZE*sizeof(int)), "cudaMalloc");
cudasafe(cudaMalloc(&d_c, SIZE*sizeof(int)), "cudaMalloc");


for (int i = 0; i < SIZE; ++i)
{
	a[i] = i;
	b[i] = i;
	c[i] = 0;
}

cudasafe(cudaMemcpy(d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice), "cudaMemcpy");
cudasafe(cudaMemcpy(d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice), "cudaMemcpy");
cudasafe(cudaMemcpy(d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice), "cudaMemcpy");

VectorAdd << <1, SIZE >> >(d_a, d_b, d_c, SIZE);

cudasafe(cudaMemcpy(c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost), "cudaMemcpy");

for (int i = 0; i < 10; ++i)
{
	printf("c[%d] - %d\n", i, c[i]);
}

free(a);
free(b);
free(c);

cudasafe(cudaFree(d_a), "cudaFree");
cudasafe(cudaFree(d_b), "cudaFree");
cudasafe(cudaFree(d_c), "cudaFree");

return 0;

}

I used GPUassert and presented the following error: GPUassert: invalid device function

Yes, you have a kernel launch error and did not implement the kernel error checking correctly (in what you have now posted).

The invalid device function error is a kernel launch error that can only be discovered if you implement the error checking correctly. This involves running cudaGetLastError() or cudaPeekAtLastError() after the kernel launch, and inspecting the return value from that, in addition to the other checks you have.

Anyway, invalid device function suggests that you are compiling for an incorrect GPU architecture.

How are you compiling this code, and what GPU do you have?

If you have created a new CUDA project in visual studio, then check the “Device” settings under CUDA properties in the project. There should be some strings like arch=compute_20,code=sm_20 These determine what GPU type you are compiling for.

It’s likely that you have a compute capability 1.x GPU but you did not properly specify the compute settings in Visual Studio, as cuda 6.5 without any switches will default to compiling for cc2.0 devices.

It’s likely that you have a problem similar to what is described here:

http://stackoverflow.com/questions/27320527/cuda-compilation-of-examples

In project properties… Cuda C/C++…Device, change to “compute_11,sm_11” or more recent “compute_12,sm_12”(it’s working in my 8400gs).

https://devtalk.nvidia.com/default/topic/788747/the-kernel-always-returns-values-equal-to-zero/