Hi everybody,
I’m totally new to CUDA, in fact this is my very first try at running some code.
I got stuck on a probably silly problem and I can’t seem to find a solution.
So, the program in question is showed in CUDACast Episode #2 (http://devblogs.nvidia.com/parallelforall/cudacasts-episode-2-your-first-cuda-c-program/) and it should simply create a kernel called VectorAdd, which adds two vectors, a and b, in parallel, and stores the results in vector c.
You can find the code here (cudacasts/kernel.cu at master · NVIDIA-developer-blog/cudacasts · GitHub) but I will paste here, too:
#include <stdio.h>
#define SIZE 1024
__global__ void VectorAdd(int *a, int *b, int *c, int n)
{
int i = threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
int main()
{
int *a, *b, *c;
int *d_a, *d_b, *d_c;
a = (int *)malloc(SIZE*sizeof(int));
b = (int *)malloc(SIZE*sizeof(int));
c = (int *)malloc(SIZE*sizeof(int));
cudaMalloc( &d_a, SIZE*sizeof(int));
cudaMalloc( &d_b, SIZE*sizeof(int));
cudaMalloc( &d_c, SIZE*sizeof(int));
for( int i = 0; i < SIZE; ++i )
{
a[i] = i;
b[i] = i;
c[i] = 0;
}
cudaMemcpy( d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice );
VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);
cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost );
for( int i = 0; i < 10; ++i)
printf("c[%d] = %d\n", i, c[i]);
free(a);
free(b);
free(c);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
The output should be:
c[0] = 0
c[1] = 2
c[2] = 4
c[3] = 6
c[4] = 8
c[5] = 10
c[6] = 12
c[7] = 14
c[8] = 16
c[9] = 18
I obtain this, instead:
c[0] = 0
c[1] = 0
c[2] = 0
c[3] = 0
c[4] = 0
c[5] = 0
c[6] = 0
c[7] = 0
c[8] = 0
c[9] = 0
I installed everything properly and passed both tests the “CUDA Getting Started” guide recommends to run to verify the installation. Here’s the output of deviceQuery.exe:
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 9600M GT"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 4) Multiprocessors, ( 8) CUDA Cores/MP: 32 CUDA Cores
GPU Clock rate: 1250 MHz (1.25 GHz)
Memory Clock rate: 800 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = GeForce 9600M GT
Result = PASS
And of bandwidthTest.exe:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce 9600M GT
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2283.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1661.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 13514.5
Result = PASS
As you can see my hardware is a GeForce 9600M GT. The installer automatically updated the graphic driver to version 340.62, so that should be the appropriate version. I’m using CUDA Toolkit 6.5 and Visual Studio 2010, if that matters.
Other than the tests mentioned above, I have compiled and run with no problem the sample projects that come with a standard installation of CUDA Toolkit.
I tried to lower the values of SIZE but the error is there no matter what.
I really don’t know what to do. Any help is greatly appreciated.