Program gives unexpected error compiles smooth, but output is unexpected result

sanf · October 15, 2011, 4:42am

Hi,

  Following is the sample CUDA program. It has two arrays(a & b) of each 10 elements. The array elements will be summed and stored in third array(c).

#include<stdio.h>
#include<cuda.h>

#define N 10

global void add(int *ca, int *cb, int *cc)
{
int i;

for(i=0; i<N; i++)
cc[i] = ca[i] + cb[i];

}

int main()
{

int a[N], b[N], c[N], i, *dev_a, *dev_b, *dev_c;

cudaMalloc(&dev_a, Nsizeof(int));
cudaMalloc(&dev_b, Nsizeof(int));
cudaMalloc(&dev_c, N*sizeof(int));

for (i=0; i<N; i++)
{
a[i]=b[i]=i;
}

printf(“\n N = %d \n”, N);
printf("a[5] = %d\n b[5] = %d ", a[5], b[5] );

cudaMemcpy(dev_a,a,Nsizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,b,Nsizeof(int),cudaMemcpyHostToDevice);

add <<< N , 1 >>> ( dev_a, dev_b, dev_c );

cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

printf(“\n THE CONTENTS ARE AS FOLLOWS: \n”);

for(i=0; i<N; i++)
printf("\t c[%d] = %d ",i, c[i]);

printf(“\n\n”);

return 0;

}

Compilation is smooth:

nvcc testarray.cu -o testarray

Output is incorrect:

time ./testarray

N = 10
a[5] = 5
b[5] = 5
THE CONTENTS ARE AS FOLLOWS:
c[0] = 612178064 c[1] = 32767 c[2] = 6308204 c[3] = 0 c[4] = 612178424 c[5] = 32767 c[6] = 612178408 c[7] = 32767 c[8] = 1 c[9] = 0

real 0m30.721s
user 0m0.051s
sys 0m30.318s

What’s going wrong here? And why its taking long time of 30 seconds for executing this small program.

Thank you

njuffa · October 15, 2011, 5:29am

The random results and the lengthy delay would appear to indicate that one of the CUDA API calls is failing. In general it is good practice to check the status of every single CUDA API call. You may also want to review your kernel code: It seems like you are launching 10 thread blocks with one thread each, and each of the 1o threads is trying to update each of 10 result elements. I assume your intention was to have ach thread update one of the elements.

sanf · October 15, 2011, 6:09am

I’ll findout the if any API calls are failing.

But what could be the reasons for failing?

That could not be a problem, because I’ve tried it by using 1 thread per block.

sanf · October 17, 2011, 10:15am

I’ve modified the code by including the CUDA error checking functions.

define N 10

global void add(int *ca, int *cb, int *cc)

{

int i;

for(i=0; i<N; i++)

cc[i] = ca[i] + cb[i];

}

int main()

{

int a[N], b[N], c[N], i, *dev_a, *dev_b, *dev_c;

if (cudaSuccess != cudaMalloc(&dev_a, N*sizeof(int)))

printf(“\n 1st CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc(&dev_b, N*sizeof(int)))

printf(“\n 2nd CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc(&dev_c, N*sizeof(int)))

printf(“\n 3rd CUDA Malloc Fail”);

for (i=0; i<N; i++)

{

a[i]=b[i]=i;

}

printf(“\n N = %d \n”, N);

printf("a[5] = %d\n b[5] = %d ", a[5], b[5] );

cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

add <<< 1 , 1 >>> ( dev_a, dev_b, dev_c );

cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(dev_a);

cudaFree(dev_b);

cudaFree(dev_c);

printf(“\n THE CONTENTS ARE AS FOLLOWS: \n”);

for(i=0; i<N; i++)

printf("\t c[%d] = %d ",i, c[i]);

printf(“\n\n”);

return 0;

}

time ./testarray

1st CUDA Malloc Fail

2nd CUDA Malloc Fail

3rd CUDA Malloc Fail

N = 10

a[5] = 5

b[5] = 5

THE CONTENTS ARE AS FOLLOWS:

 c[0] =  180628368 	 c[1] =  32767 	 c[2] =  6308332 	 c[3] =  0 	 c[4] =  180628728 	 c[5] =  32767 	 c[6] =  180628712 	 c[7] = 32767 	   c[8] =  1 	 c[9] =  0

real 0m31.251s

user 0m0.056s

sys 0m30.775s

The s/w & h/w environment is as follows:

DEVICE INFORMATION:

Device 0: “Tesla C2050”

Type of device: GPU

Compute capability: 2

Double precision support: Yes

Total amount of global memory: 2.62421 GB

Number of compute units/multiprocessors: 14

Number of cores: 448

Total amount of constant memory: 65536 bytes

Total amount of local/shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum group size (# of threads per block) 1024 x 1024 x 64

Maximum item sizes (# threads for each dim) 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.147 GHz

Concurrent copy and execution: Yes

GCC compiler version:

gcc --version

gcc (GCC) 4.5.3

NVCC/CUDA Toolkit

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Built on Wed_Sep__8_17:12:45_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

Operating System:

Red Hat Enterprise Linux Server release 5.4 Beta (Tikanga)

This program has worked previously with producing accurate results.

But now its failing.

Can some one help to know why are these cuda functions failing…?

Thanks

mfatica · October 17, 2011, 11:03am

All your cudaMalloc should be:
cudaMalloc((void**)&dev_a, Nsizeof(int)))
not
cudaMalloc(&dev_a, Nsizeof(int)))

sanf · October 17, 2011, 11:37am

I modified the code as you suggested:

if (cudaSuccess != cudaMalloc((void**)&dev_a, N*sizeof(int)))

printf(“\n 1st CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc((void**)&dev_b, N*sizeof(int)))

printf(“\n 2nd CUDA Malloc Fail”);

if (cudaSuccess != cudaMalloc((void**)&dev_c, N*sizeof(int)))

printf(“\n 3rd CUDA Malloc Fail”);

But still the same errors continue. How to debug it further??

Topic		Replies	Views
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2801	September 10, 2010
Silent kernel failure CUDA Programming and Performance	25	8296	May 18, 2020
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1089	February 2, 2014
First CUDA program -- Integrating CUDA with existing code base -- not working. CUDA Programming and Performance	5	1212	June 13, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8618	December 18, 2008
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8620	February 17, 2010
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7940	June 14, 2013
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
Cuda malfunctions CUDA Programming and Performance	5	726	March 13, 2023
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1310	January 6, 2017

Program gives unexpected error compiles smooth, but output is unexpected result

nvcc testarray.cu -o testarray

time ./testarray

time ./testarray

gcc --version

nvcc --version

Related topics