CUDA kernels giving bad results

I am a CUDA beginner who has successfully compiled and run several code samples using CUDA libraries such as CUFFT and CUBLAS. Lately, however, I have been trying to generate my own simple kernels and am repeatedly receiving nonsense values back after calling my kernels. That is–when I pass a parameter into a kernel, set its value in the kernel, then try to copy the results back to the host and read the values later, they are bogus. I have tried many different simple tutorial kernels that seem to work for most people online, but I always get nonsensical values. For example…

#define SIZE 10

// Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide

    __global__  void vecAdd(float* A, float* B, float* C) {

// threadIdx.x is a built-in variable provided by CUDA at runtime

      int i = threadIdx.x;

      A[i]=0;

      B[i]=i;

      C[i] = A[i] + B[i];

}

int main () {

int N=SIZE;

      float A, B, C;

      float *devPtrA;

      float *devPtrB;

      float *devPtrC;

      int memsize= SIZE * sizeof(float);

cudaMalloc((void**)&devPtrA, memsize);

      cudaMalloc((void**)&devPtrB, memsize);

      cudaMalloc((void**)&devPtrC, memsize);

      cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);

      cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);

      // __global__ functions are called:  Func<<< Dg, Db, Ns >>>(parameter);                                          

      vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);

      cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

for (int i=0; i<SIZE; i++)

        printf("C[%d]=%f\n",i,C[i]);

cudaFree(devPtrA);

      cudaFree(devPtrB);

      cudaFree(devPtrC);

}

This is a fairly straightforward problem; the results should be:

C[0]=0.000000

C[1]=1.000000 

C[2]=2.000000 

C[3]=3.000000 

C[4]=4.000000 

C[5]=5.000000 

C[6]=6.000000 

C[7]=7.000000 

C[8]=8.000000 

C[9]=9.000000 

However, my results are always random and generally look more like:

C[0]=nan

C[1]=-32813464158208.000000

C[2]=nan

C[3]=-27667211200843743232.000000

C[4]=34559834084263395806523272811251761152.000000

C[5]=9214363188332593152.000000

C[6]=nan

C[7]=-10371202300694685655937271382147072.000000

C[8]=121653576586393934243511643668480.000000

C[9]=-30648783863808.000000

So basically, when I pass parameters into a CUDA kernel with the intention of storing results within them to be copied back to the host, I tend to get out junk.

Any help would be greatly appreciated.

Thanks.

Not likey to be source of problem but

cudaFree(devPtrA);
cudaFree(devPtrA);
cudaFree(devPtrA);

doesn’t seem that useful…

Your code is correct, aside from a missing stdio and () for main.

It seems like a configuration problem.

#include "stdio.h"

#define SIZE 10

// Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide 

__global__ void vecAdd(float* A, float* B, float* C) {

// threadIdx.x is a built-in variable provided by CUDA at runtime 

int i = threadIdx.x;

A[i]=0;

B[i]=i;

C[i] = A[i] + B[i];

}

int main() {

int N=SIZE;

float A, B, C;

float *devPtrA;

float *devPtrB;

float *devPtrC;

int memsize= SIZE * sizeof(float);

cudaMalloc((void**)&devPtrA, memsize);

cudaMalloc((void**)&devPtrB, memsize);

cudaMalloc((void**)&devPtrC, memsize);

cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);

cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);

// __global__ functions are called: Func<<< Dg, Db, Ns >>>(parameter); 

vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);

cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);

for (int i=0; i<SIZE; i++)

printf("C[%d]=%f\n",i,C[i]);

cudaFree(devPtrA);

cudaFree(devPtrA);

cudaFree(devPtrA);

}
nvcc -o simpleadd simpleadd.cu 

$ ./simpleadd 

C[0]=0.000000

C[1]=1.000000

C[2]=2.000000

C[3]=3.000000

C[4]=4.000000

C[5]=5.000000

C[6]=6.000000

C[7]=7.000000

C[8]=8.000000

C[9]=9.000000

check errors from API calls, lest you face code that should work but doesn’t

Good catch. This was taken from an example online and I hadn’t noticed.

Thank you for the reply. I’ve heard of “CUDA_SAFE_CALL”. Is this the best way to check errors from API calls?

No, CUDA_SAFE_CALL is just a macro defined for the purposes of the NVIDIA SDK to make example code shorter. You should not rely on it in your code.

What you should do is to check the return values of CUDA functions that you call. Most return an error code (see documentation or reference). Usually if this code is not zero, you have an error condition described by the value.

One might argue that it’s a good idea to have your own copy of CUDA_SAFE_CALL in your own header files. Wrapping all CUDA calls in that is much better than doing no error checking (as we see regularly on the forums), and abort-on-error is perfectly OK for a lot of people. The amount of typing you have to do otherwise very rapidly gets tedious. When you want to do something more sophisticated (such as throwing an exception)… just update the macro.