CUDA speed testing help

first of all im very new to cuda and very familiar with c/c++. my main goal is to be able to use cuda to implement real time stereo-matching of images by running many threads at once. However since im new to cuda i thought it would be best to start small and show myself how CUDA is used and run a few speed tests. Im having trouble understand this code sample from Dr. Dobbs journal.

// incrementArray.cu

#include <stdio.h>
#include <windows.h>
#include <assert.h>
#include <cuda.h>
#include “stopwatch.hpp”

//handled by the gpu
global void incrementArrayOnDevice(double a, double N)
{
int idx = blockIdx.x
blockDim.x + threadIdx.x;
if (idx<N) a[idx] += 1;
}

int main(void)
{
Stopwatch timer;
double i = 0;
double numDoubles = 256 * 8388608;
double* cuda_data;
double* result_data;

result_data = (double*)malloc(sizeof(double)*numDoubles);
cudaMalloc((void**)&cuda_data, sizeof(double)*numDoubles);

// do calculation using cuda:
// Part 1 of 2. Compute execution configuration
double numThreadsPerBlock = 256;
double numBlocks = numDoubles / numThreadsPerBlock;

// Part 2 of 2. Call incrementArrayOnDevice kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
timer.start();
incrementArrayOnDevice <<< dimGrid, dimBlock >>> (cuda_data, numDoubles);

// Retrieve result from device and store in b_h
cudaMemcpy(result_data, cuda_data, sizeof(double)*numDoubles, cudaMemcpyDeviceToHost);
timer.stop();
printf("Time to calculate using cuda: %i\n", timer.getTime());
timer.reset();

// cleanup
delete [] result_data; cudaFree(cuda_data);

system("pause");

}

can somebody please explain to me what the kernel is actually doing once its called. How many threads are being run at the same time once executed?
Also if i ran through a for loop numDoubles time on the host, is it going to be much slower than the kernel call to cuda?

I need to somehow show myself that CUDA is performing much faster than the CPU would be, but im doing non-trivial work so i cant tell much a difference of whats being done.
I hope i explained my situation well, i hope someone can help me.

Thanks very much in advance

I highly suggest that you read the CUDA programming guide. Start at the beginning and read the first few chapters straight through, they will explain a lot.

To answer your specific question on the number of threads, numThreadsPerBlock threads are each run in numBlocks blocks. So numBlocks * numThreadsPerBlock threads are run. Each operates on a single array element and adds 1 to it as you can see in incrementArrayOnDevice(). (Note that you will have serious issues running that kernel unless you run on a G200 and compile with -arch sm_13 to enable support for doubles.)

Thank you for the reply, i have a 8800gtx and i ran the kernel fine.
Where is the CUDA programming guide found?

thanks again

Odd, most people report problems even reading doubles on pre compute 1.3 hardware.

Anyways, the programming guide is installed in the doc/ directory of the CUDA toolkit. It is also available under the “Developing with cuda”->Documentation link at NVIDIA’s web site: http://www.nvidia.com/object/cuda_develop.html

great, thanks again!