Program gives unexpected error compiles smooth, but output is unexpected result

Hi,

  Following is the sample CUDA program. It has two arrays(a & b) of each 10 elements. The array elements will be summed and stored in third array(c).

#include<stdio.h>
#include<cuda.h>

#define N 10

global void add(int *ca, int *cb, int *cc)
{
int i;

for(i=0; i<N; i++)
cc[i] = ca[i] + cb[i];

}

int main()
{

int a[N], b[N], c[N], i, *dev_a, *dev_b, *dev_c;

cudaMalloc(&dev_a, Nsizeof(int));
cudaMalloc(&dev_b, N
sizeof(int));
cudaMalloc(&dev_c, N*sizeof(int));

for (i=0; i<N; i++)
{
a[i]=b[i]=i;
}

printf(“\n N = %d \n”, N);
printf("a[5] = %d\n b[5] = %d ", a[5], b[5] );

cudaMemcpy(dev_a,a,Nsizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,N
sizeof(int),cudaMemcpyHostToDevice);

add <<< N , 1 >>> ( dev_a, dev_b, dev_c );

cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

printf(“\n THE CONTENTS ARE AS FOLLOWS: \n”);

for(i=0; i<N; i++)
printf("\t c[%d] = %d ",i, c[i]);

printf(“\n\n”);

return 0;

}

Compilation is smooth:

nvcc testarray.cu -o testarray

Output is incorrect:

time ./testarray

N = 10
a[5] = 5
b[5] = 5
THE CONTENTS ARE AS FOLLOWS:
c[0] = 612178064 c[1] = 32767 c[2] = 6308204 c[3] = 0 c[4] = 612178424 c[5] = 32767 c[6] = 612178408 c[7] = 32767 c[8] = 1 c[9] = 0

real 0m30.721s
user 0m0.051s
sys 0m30.318s

What’s going wrong here? And why its taking long time of 30 seconds for executing this small program.

Thank you

The random results and the lengthy delay would appear to indicate that one of the CUDA API calls is failing. In general it is good practice to check the status of every single CUDA API call. You may also want to review your kernel code: It seems like you are launching 10 thread blocks with one thread each, and each of the 1o threads is trying to update each of 10 result elements. I assume your intention was to have ach thread update one of the elements.

I’ll findout the if any API calls are failing.

But what could be the reasons for failing?

That could not be a problem, because I’ve tried it by using 1 thread per block.

I’ve modified the code by including the CUDA error checking functions.

define N 10

global void add(int *ca, int *cb, int *cc)

{

int i;

for(i=0; i<N; i++)

cc[i] = ca[i] + cb[i];

}

int main()

{

int a[N], b[N], c[N], i, *dev_a, *dev_b, *dev_c;

if (cudaSuccess != cudaMalloc(&dev_a, N*sizeof(int)))

printf(“\n 1st CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc(&dev_b, N*sizeof(int)))

printf(“\n 2nd CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc(&dev_c, N*sizeof(int)))

printf(“\n 3rd CUDA Malloc Fail”);

for (i=0; i<N; i++)

{

a[i]=b[i]=i;

}

printf(“\n N = %d \n”, N);

printf("a[5] = %d\n b[5] = %d ", a[5], b[5] );

cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

add <<< 1 , 1 >>> ( dev_a, dev_b, dev_c );

cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(dev_a);

cudaFree(dev_b);

cudaFree(dev_c);

printf(“\n THE CONTENTS ARE AS FOLLOWS: \n”);

for(i=0; i<N; i++)

printf("\t c[%d] = %d ",i, c[i]);

printf(“\n\n”);

return 0;

}

time ./testarray

1st CUDA Malloc Fail

2nd CUDA Malloc Fail

3rd CUDA Malloc Fail

N = 10

a[5] = 5

b[5] = 5

THE CONTENTS ARE AS FOLLOWS:

 c[0] =  180628368 	 c[1] =  32767 	 c[2] =  6308332 	 c[3] =  0 	 c[4] =  180628728 	 c[5] =  32767 	 c[6] =  180628712 	 c[7] = 32767 	   c[8] =  1 	 c[9] =  0 

real 0m31.251s

user 0m0.056s

sys 0m30.775s

The s/w & h/w environment is as follows:

DEVICE INFORMATION:

Device 0: “Tesla C2050”

Type of device: GPU

Compute capability: 2

Double precision support: Yes

Total amount of global memory: 2.62421 GB

Number of compute units/multiprocessors: 14

Number of cores: 448

Total amount of constant memory: 65536 bytes

Total amount of local/shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum group size (# of threads per block) 1024 x 1024 x 64

Maximum item sizes (# threads for each dim) 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.147 GHz

Concurrent copy and execution: Yes

GCC compiler version:

gcc --version

gcc (GCC) 4.5.3

NVCC/CUDA Toolkit

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Wed_Sep__8_17:12:45_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

Operating System:

Red Hat Enterprise Linux Server release 5.4 Beta (Tikanga)

This program has worked previously with producing accurate results.

But now its failing.

Can some one help to know why are these cuda functions failing…?

Thanks

All your cudaMalloc should be:
cudaMalloc((void**)&dev_a, Nsizeof(int)))
not
cudaMalloc(&dev_a, N
sizeof(int)))

I modified the code as you suggested:

if (cudaSuccess != cudaMalloc((void**)&dev_a, N*sizeof(int)))

printf(“\n 1st CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc((void**)&dev_b, N*sizeof(int)))

printf(“\n 2nd CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc((void**)&dev_c, N*sizeof(int)))

printf(“\n 3rd CUDA Malloc Fail”);

But still the same errors continue. How to debug it further??