first of all im very new to cuda and very familiar with c/c++. my main goal is to be able to use cuda to implement real time stereo-matching of images by running many threads at once. However since im new to cuda i thought it would be best to start small and show myself how CUDA is used and run a few speed tests. Im having trouble understand this code sample from Dr. Dobbs journal.
//handled by the gpu
global void incrementArrayOnDevice(double a, double N)
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] += 1;
double i = 0;
double numDoubles = 256 * 8388608;
result_data = (double*)malloc(sizeof(double)*numDoubles); cudaMalloc((void**)&cuda_data, sizeof(double)*numDoubles); // do calculation using cuda: // Part 1 of 2. Compute execution configuration double numThreadsPerBlock = 256; double numBlocks = numDoubles / numThreadsPerBlock; // Part 2 of 2. Call incrementArrayOnDevice kernel dim3 dimGrid(numBlocks); dim3 dimBlock(numThreadsPerBlock); timer.start(); incrementArrayOnDevice <<< dimGrid, dimBlock >>> (cuda_data, numDoubles); // Retrieve result from device and store in b_h cudaMemcpy(result_data, cuda_data, sizeof(double)*numDoubles, cudaMemcpyDeviceToHost); timer.stop(); printf("Time to calculate using cuda: %i\n", timer.getTime()); timer.reset(); // cleanup delete  result_data; cudaFree(cuda_data); system("pause");
can somebody please explain to me what the kernel is actually doing once its called. How many threads are being run at the same time once executed?
Also if i ran through a for loop numDoubles time on the host, is it going to be much slower than the kernel call to cuda?
I need to somehow show myself that CUDA is performing much faster than the CPU would be, but im doing non-trivial work so i cant tell much a difference of whats being done.
I hope i explained my situation well, i hope someone can help me.
Thanks very much in advance