Unespected output for a basic program

Hi everybody,

I’m totally new to CUDA, in fact this is my very first try at running some code.
I got stuck on a probably silly problem and I can’t seem to find a solution.

So, the program in question is showed in CUDACast Episode #2 (http://devblogs.nvidia.com/parallelforall/cudacasts-episode-2-your-first-cuda-c-program/) and it should simply create a kernel called VectorAdd, which adds two vectors, a and b, in parallel, and stores the results in vector c.

You can find the code here (https://github.com/parallel-forall/cudacasts/blob/master/ep2-first-cuda-c-program/kernel.cu) but I will paste here, too:

#include <stdio.h>

#define SIZE	1024

__global__ void VectorAdd(int *a, int *b, int *c, int n)
{
	int i = threadIdx.x;

	if (i < n)
		c[i] = a[i] + b[i];
}

int main()
{
	int *a, *b, *c;
	int *d_a, *d_b, *d_c;

	a = (int *)malloc(SIZE*sizeof(int));
	b = (int *)malloc(SIZE*sizeof(int));
	c = (int *)malloc(SIZE*sizeof(int));

	cudaMalloc( &d_a, SIZE*sizeof(int));
	cudaMalloc( &d_b, SIZE*sizeof(int));
	cudaMalloc( &d_c, SIZE*sizeof(int));

	for( int i = 0; i < SIZE; ++i )
	{
		a[i] = i;
		b[i] = i;
		c[i] = 0;
	}

	cudaMemcpy( d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice );
	cudaMemcpy( d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice );
	cudaMemcpy( d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice );

	VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);
	
	cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost );

	for( int i = 0; i < 10; ++i)
		printf("c[%d] = %d\n", i, c[i]);

	free(a);
	free(b);
	free(c);

	cudaFree(d_a);
	cudaFree(d_b);
	cudaFree(d_c);

	return 0;
}

The output should be:

c[0] = 0
c[1] = 2
c[2] = 4
c[3] = 6
c[4] = 8
c[5] = 10
c[6] = 12
c[7] = 14
c[8] = 16
c[9] = 18

I obtain this, instead:

c[0] = 0
c[1] = 0
c[2] = 0
c[3] = 0
c[4] = 0
c[5] = 0
c[6] = 0
c[7] = 0
c[8] = 0
c[9] = 0

I installed everything properly and passed both tests the “CUDA Getting Started” guide recommends to run to verify the installation. Here’s the output of deviceQuery.exe:

deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 9600M GT"
  CUDA Driver Version / Runtime Version          6.5 / 6.5
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 512 MBytes (536870912 bytes)
  ( 4) Multiprocessors, (  8) CUDA Cores/MP:     32 CUDA Cores
  GPU Clock rate:                                1250 MHz (1.25 GHz)
  Memory Clock rate:                             800 Mhz
  Memory Bus Width:                              128-bit
  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  768
  Maximum number of threads per block:           512
  Max dimension size of a thread block (x,y,z): (512, 512, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce 9600M GT
Result = PASS

And of bandwidthTest.exe:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce 9600M GT
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2283.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1661.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     13514.5

Result = PASS

As you can see my hardware is a GeForce 9600M GT. The installer automatically updated the graphic driver to version 340.62, so that should be the appropriate version. I’m using CUDA Toolkit 6.5 and Visual Studio 2010, if that matters.

Other than the tests mentioned above, I have compiled and run with no problem the sample projects that come with a standard installation of CUDA Toolkit.

I tried to lower the values of SIZE but the error is there no matter what.
I really don’t know what to do. Any help is greatly appreciated.

You should use error checking after each call to CUDA functions/kernels, something like http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api for example.

My guess is that you compiled for the wrong compute capability. The default in CUDA 6.5 is cc 2.0.
For your GPU it should be 1.1. Try compiling with nvcc -arch=sm_11 kernel.cu

Thank you for your help, unfortunately changing the compute capability did not solve the problem. The output is still the same.

I did as shown here in the accepted answer http://stackoverflow.com/questions/14411435/how-to-set-cuda-compiler-flags-in-visual-studio-2010 , setting compute_11,sm_11. The compiler did get it and showed a warning about the future deprecation of 1.x compute capability. Still, no luck.

I’ll try to use error checking but I have no clue what the problem may be.

#define SIZE 1024

Maximum number of threads per block: 512

use something like cudaGetLastError() to see if the kernel even launches

and you probably need a cudaDeviceSynchronize() between

cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost );

for( int i = 0; i < 10; ++i)
printf(“c[%d] = %d\n”, i, c[i]);

Thank you little_jimmy, your advice combined with that of hadschi118 did the trick!

As I said, I already tried to lower SIZE value previously but seeing no results I reset it back to 1024. I should have read more carefully that log to see the hardware limitation.

Now, with the compute capability properly set, changing SIZE value to 512 made the program finally work.

Thank you both for taking the time to help me, I realize this was a very noob/dumb problem and not everyone would have done it.

I shouldve F5d before replying :)