Getting started with CUDA ... cannot add simple vectors

hello everybody,

i am totally new to CUDA and i wanted to do some simple vector addition but somehow i always get zero answers.
can anybody help me out?
the code is really easy and i have absolutely no idea what there could potentially be wrong.

hs@quad:/data/projects/crealiity/cuda$ cat main.cu
#include <stdio.h>
#include <cuda.h>

#define VAR 8
#define N 8

global void matAdd(float *A, float *B, float *C)
{
int i = threadIdx.x;

C[i] = A[i] * B[i]; 

}

int
main()
{
float *A_h;
float *B_h;
float *C_h;

float 	*A_d;
float 	*B_d;
float 	*C_d;

A_h = (float *) (malloc(sizeof(float) * VAR));
B_h = (float *) (malloc(sizeof(float) * VAR));
C_h = (float *) (malloc(sizeof(float) * VAR));

cudaMalloc( (void **) &A_d, sizeof(float) * VAR);
cudaMalloc( (void **) &B_d, sizeof(float) * VAR);
cudaMalloc( (void **) &C_d, sizeof(float) * VAR);

printf("RAM allocated ...\n");

for	(int i = 0; i < VAR; i++)
{
	A_h[i] = 2.0f;
	B_h[i] = 2.0f;
}	

printf("calling kernel ...\n");
cudaMemcpy(A_d, A_h, sizeof(float) * VAR, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, sizeof(float) * VAR, cudaMemcpyHostToDevice);

// Kernel invocation 
dim3 dimBlock(1, 4); 

/* copy data to GPU */
printf("copy data ...\n");


matAdd<<<1, dimBlock>>>(A_d, B_d, C_d); 
printf("addition done ...\n");

/* copy answer back and display */
cudaMemcpy(C_h, C_d, sizeof(float) * VAR, cudaMemcpyDeviceToHost);

for	(int i = 0; i < VAR; i++)
{
	printf("line: %f\n", C_h[i]);
}

printf("fixed ...\n");
return 0;

}

i am compiling two vectors, copy them to the GPU and add them up.
then i copy the answer back into the CPU.
no core dumps, no compiler warnings - just crapty answers.

hs@quad:/data/projects/crealiity/cuda$ make
PATH=/usr/local/cuda/bin:/usr/local/cuda/bin/:/home/hs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
nvcc main.cu -o prog

the program runs nicely …

hs@quad:/data/projects/crealiity/cuda$ ./prog
RAM allocated …
calling kernel …
copy data …
addition done …
line: 0.000000
line: 0.000000
line: 0.000000
line: 0.000000
line: 0.000000
line: 0.000000
line: 0.000000
line: 0.000000
fixed …

i have a CUDA compliant card.

hs@quad:/data/projects/crealiity/cuda$ su root
Password:
root@quad:/data/projects/crealiity/cuda# nvclock -i
– General info –
Card: nVidia Geforce 8600GT
Architecture: G84 A2
PCI id: 0x402
GPU clock: 540.000 MHz
Bustype: PCI-Express

– Shader info –
Clock: 1188.000 MHz
Stream units: 32 (11b)
ROP units: 8 (11b)
– Memory info –
Amount: 256 MB
Type: 128 bit DDR3
Clock: 702.000 MHz

– PCI-Express info –
Current Rate: 16X
Maximum rate: 16X

– Sensor info –
Sensor: GPU Internal Sensor
GPU temperature: 80C

– VideoBios information –
Version: 60.84.35.00.00
Signon message: GeForce 8600 GT VGA BIOS
Performance level 0: gpu 540MHz/shader 1188MHz/memory 700MHz/0.00V/100%
VID mask: 3
Voltage level 0: 1.20V, VID: 1
Voltage level 1: 1.32V, VID: 3

can anybody tell me what i am doing wrong?

many thanks,

hans

After a quick read-through of your code, the only thing amiss I see is that your grid/thread configuration is incorrect. You are launching 1 block that is 1x4, but indexing by threadIdx.x in your kernel.

However, that would run the 4 threads with threadIdx.x=0 so I’m not sure why the 0’th element is not correct. You are also printing 8 elements but the kernel only writes to 4 of them.

Hello …

Thank you for your reply.

I experimented with the numbers but it makes no real difference.

look at this - this is scary:

zoltan@quad:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery

There is 1 device supporting CUDA

Device 0: “GeForce 8600 GT”

Major revision number: 1

Minor revision number: 1

Total amount of global memory: 267714560 bytes

Number of multiprocessors: 16

Number of cores: 128

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.19 GHz

Concurrent copy and execution: Yes

Test PASSED

Press ENTER to exit…

zoltan@quad:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./histogram64 --help

Using device 0: GeForce 8600 GT

Initializing data…

…allocating CPU memory.

…generating input data

…allocating GPU memory and copying input data

Running GPU histogram (1 iterations)…

histogram64GPU() time (average) : 0.090000 msec //1059638.087234 MB/sec

Comparing the results…

…histogramCPU()

histogram64CPU() time : 110.294998 msec //864.657811 MB/sec

Total sum of histogram elements: -1464298936

Sum of absolute differences: -1415197770

TEST FAILED

Shutting down…

Press ENTER to exit…

one test failed and the second one succeeds. my card should be able to handle cuda but for some reason the histogram test does not work.

what could be the reason?

is this a known issue?

hans

Seems strange to me. I ran your code (added free and cudaFree calls at the end of it External Image, and also I zeroed the C_d array using cudaMemSet).

I got the expected result of having only the first element in the output set to 4 (you’ve been noted on this in a previous reply - block dimensions…).

So other than the two issues I listed in the brackets, I had no problems with it.

The odd thing here is that we seem to have the same card (maybe nt the same vendor but same chip) yet mine gives out a totally different output

on the first test, and passes the 2nd one:

$ ~/NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery There is 1 device supporting CUDA

Device 0: "GeForce 8600 GT"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 536150016 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.19 GHz

  Concurrent copy and execution:				 Yes

Test PASSED

$ ~/NVIDIA_CUDA_SDK/bin/linux/release/histogram64 --help

Using device 0: GeForce 8600 GT

Initializing data...

...allocating CPU memory.

...generating input data

...allocating GPU memory and copying input data

Running GPU histogram (1 iterations)...

histogram64GPU() time (average) : 49.438999 msec //1928.991954 MB/sec

Comparing the results...

...histogramCPU()

histogram64CPU() time : 141.229996 msec //675.263291 MB/sec

Total sum of histogram elements: 100000000

Sum of absolute differences: 0

TEST PASSED

Shutting down...

I seem to have only 4 multiprocessors and 32 cores, while your card reports 16 multiprocessors and 128 cores. Now, this is really really strange to me. Can someone

from NVidia clear this up?

what driver are you using? 16 MP/128 cores was a bug from 169 or 173.xx, I think, so odds are low that it’s working correctly with CUDA 2.0 (or 2.1) examples

I’m using the 177.82 driver. And from what I understand from tmurray’s post, the report postgresql is getting are incorrect, as the 8600GT has only 4MP/32Cores, like what I’m getting.

postgresql, what driver are you using? Try setting up the 177.82 driver and re-run the tests and your code.

Liad.

hello everybody,

i guess this was the golden advise.
i am using the 169 driver. this can be an issue …
it seems to work once in 10000 tries even.
i will fix the driver side.

many thanks,

hans

Hi Hans,

If you have a chance to look at my code, vector adding, would you mind to help me figure out the error? It produces exactly same all 0 result for GPGPU partition.

Thank you so much!

Jason


include <stdio.h>

include <stdlib.h>

include <cuda.h>

include <cuda_runtime.h>

include <cutil.h>

include <windows.h>

//define VAR 100000000

define VAR 8

global void vecAdd_d(float *A, float *B, float *C, int N) {

//int tid = blockDim.x * blockIdx.x + threadIdx.x;

int tid = threadIdx.x;

if( tid < N)

	C[tid] = A[tid] + B[tid];

}

void vecAdd_h(float *A, float *B, float *C, int N) {

for( int i = 0; i < N; i++){

	C[i] = A[i] + B[i]; 

// printf(“%f\n”,C[i]);

}

}

void cudasafe( cudaError_t error, char* message) {

if(error!=cudaSuccess) { 

	fprintf(stderr,"ERROR: %s : %i\n",message,error); 

	exit(-1); 

}

}

int main() {

float *A_h;

float *B_h;

float *C_h;

float *A_d;

float *B_d;

float *C_d;



printf("**************GPU Processing****************\n\n");

printf("Device Initialization ...\n\n");

A_h = (float *) (malloc(sizeof(float) * VAR));

B_h = (float *) (malloc(sizeof(float) * VAR));

C_h = (float *) (malloc(sizeof(float) * VAR));

cudasafe(cudaMalloc( (void**)&A_d, sizeof(float) * VAR), "cudaMalloc");

cudasafe(cudaMalloc( (void**)&B_d, sizeof(float) * VAR), "cudaMalloc");

cudasafe(cudaMalloc( (void**)&C_d, sizeof(float) * VAR), "cudaMalloc");

printf("Memory Allocation is Done!\n");

for (int i = 0; i < VAR; i++){

	A_h[i] = 2.0;

	B_h[i] = 2.0;

}

cudasafe(cudaMemcpy(A_d, A_h, sizeof(float) * VAR, cudaMemcpyHostToDevice), "cudaMemcpy");

cudasafe(cudaMemcpy(B_d, B_h, sizeof(float) * VAR, cudaMemcpyHostToDevice), "cudaMemcpy");

printf("Data copied from Host to Device!\n");

int threadsPerBlock = 4;

int blocksPerGrid = /*(VAR + threadsPerBlock -1)/threadsPerBlock*/2;

dim3 dimBlock(threadsPerBlock, 1, 1);

dim3 dimGrid(blocksPerGrid, 1, 1);

printf("Thread configuration is Done!\n\n");



// Kernel invocation 

printf("Invoking Kernel functions...\n");

LARGE_INTEGER curFreq_d, curStart_d, curEnd_d;

QueryPerformanceFrequency(&curFreq_d);

QueryPerformanceCounter(&curStart_d);

//vecAdd_d<<<dimGrid, dimBlock>>>(A_d, B_d, C_d, VAR);

vecAdd_d<<<blocksPerGrid, threadsPerBlock>>>(A_d, B_d, C_d, VAR);

//vecAdd_d<<<dimGrid, 1, 1>>>(A_d, B_d, C_d, VAR);

cudasafe(cudaThreadSynchronize(), "cudaThreadSynchronize");

QueryPerformanceCounter(&curEnd_d);

cudasafe(cudaMemcpy(C_h, C_d, sizeof(float) * VAR, cudaMemcpyDeviceToHost), "cudaMemcpy");

printf("Data copied from Device to Host!\n\n");

printf("Device adding result:\n");

for (int i = 0; i < VAR; i++)

	printf("line: %f\n", C_h[i]);



printf("\n");

cudasafe(cudaFree(A_d), "cudaFree");

cudasafe(cudaFree(B_d), "cudaFree");

cudasafe(cudaFree(C_d), "cudaFree");

printf("Device memory space is Freed!\n\n");

double time_d = (double)(curEnd_d.QuadPart-curStart_d.QuadPart)/curFreq_d.QuadPart;

printf("Device Executing Time: %f(ms)\n", time_d * 1000);

printf("**************GPU Processing is Done****************\n\n");

printf(“CPU Processing**\n\n”);

LARGE_INTEGER curFreq_h, curStart_h, curEnd_h;

QueryPerformanceFrequency(&curFreq_h);

QueryPerformanceCounter(&curStart_h);

vecAdd_h(A_h, B_h, C_h, VAR);

QueryPerformanceCounter(&curEnd_h);

printf("Host adding result:\n");

for (int i = 0; i < VAR; i++)

	printf("line: %f\n", C_h[i]);

double time_h = (double)(curEnd_h.QuadPart-curStart_h.QuadPart)/curFreq_h.QuadPart;

printf("Host Executing Time = %f(ms)\n",  time_h * 1000);

printf("**************CPU Processing is Done****************\n\n");

printf("Vector Size = %d\n", VAR);

printf("Speedup = %f\n", time_h/time_d);

delete A_h;

delete B_h;

delete C_h;

return 0;

}

I have a couple of general advises.

I don’t remember details, but I have some memory that some error cases are missed by cudaThreadSynchronize(…) invoked after the kernel launch. I think if the kernel doesn’t launch at all, then cudaThreadSynchronize(…) function would not return a error. I would suggest invoking getLastError(…) right after the <<<…>>> operator (kernel launch) and check its error code, just to be sure.

On a separate note, you might want to initialize C_h with some garbage numbers to verify that these variables are overwritten in the process of pulling the data from the device. I would do it in the same loop as where you initialize A_h, B_h.

Good luck!

Hi,

Total newb myself. Not a real programmer and just starting to work through examples, etc.

Not sure if it helps but you code compiles for me and seems to give some answers:

mark@k2 ~/code $ nvcc vectorAdd.cu -o vectorAdd

mark@k2 ~/code $ ./vectorAdd

RAM allocated …

calling kernel …

copy data …

addition done …

line: 4.000000

line: -0.000000

line: -0.000885

line: -6594920041903528927565384253440.000000

line: -0.000000

line: 0.000000

line: 477388702928806608896.000000

line: 4639240329575268352.000000

fixed …

mark@k2 ~/code $

mark@k2 ~/code/chapter03 $ ./enum_gpu

— General Information for device 0 —

Name: GeForce 9500 GT

Compute capability: 1.1

Clock rate: 1400000

Device copy overlap: Enabled

Kernel execution timeout : Enabled

— Memory Information for device 0 —

Total global mem: 1073020928

Total constant Mem: 65536

Max mem pitch: 2147483647

Texture Alignment: 256

— MP Information for device 0 —

Multiprocessor count: 4

Shared mem per mp: 16384

Registers per mp: 8192

Threads in warp: 32

Max threads per block: 512

Max thread dimensions: (512, 512, 64)

Max grid dimensions: (65535, 65535, 1)

mark@k2 ~/code/chapter03 $

Cheers