using double in kernel Rob Farber's Dr.Dobb's code using double

I’m trying to use the code provided by Rob Farber in the Dr.Dobb’s May 13, 2008 article “CUDA, Supercomputing” for the Masses: Part 3". The code works well with types int and float, but does not work (does not return the same numbers) with type double for host/device memory. I’m using xp 64, ms visual studio 2005, cuda 2.1; my video card is 9800 gx2.

[codebox]/* test Dr.Dobb’s code

  • use FLOAT or DOUBLE array,

  • submit to nvidia cuda forum


// includes, system

#include <stdio.h>

#include <math.h>

#include <assert.h>

// includes, project

#include <cutil_inline.h>

// Simple utility function to check for CUDA runtime errors

void checkCUDAError(const char* msg);

// Part3: implement the kernel

global void reverseArrayBlock(double *d_out, double *d_in)


int inOffset = blockDim.x * blockIdx.x; 

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); 

int in = inOffset + threadIdx.x; 

int out = outOffset + (blockDim.x - 1 - threadIdx.x); 

d_out[out] = d_in[in];


int main( int argc, char** argv)


// pointer for host memory and size 

double *h_a; 

// pointer for device memory 

double *d_b, *d_a;

// define number of elements, grid, and block size 

int dimA = 256 * 1024; // 256K elements (1MB total)	

int numThreadsPerBlock = 256;

// Part 1: compute number of blocks needed based on 

// array size and desired block size 

int numBlocks = dimA / numThreadsPerBlock;

// allocate host and device memory 

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(double); 

h_a = (double *) malloc(memSize); 

cudaMalloc( (void **) &d_a, memSize ); 

cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host 

for (int i = 0; i < dimA; i++)


	h_a[i] = rand() / (double)RAND_MAX; 

	if (i == 0 || i == dimA-1)

		printf("h_a[%d] %4.4f \n",i,h_a[i]);		


// Copy host array to device array 

cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel 

dim3 dimGrid(numBlocks); 

dim3 dimBlock(numThreadsPerBlock); 

reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

// block until the device has completed 


// device to host copy 

cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors 


// verify the data returned to the host is correct 

for (int i = 0; i < dimA; i++) 


	//assert(h_a[i] == dimA - 1 - i ); 		

	if (i == 0 || i == dimA-1)

		printf("h_a[%d] %4.4f \n",i,h_a[i]);


// free device memory 



// free host memory 


// If the program makes it this far, then the results are 

// correct and there are no run-time errors. Good work! 



cutilExit(argc, argv);

return 0; 


void checkCUDAError(const char *msg)


cudaError_t err = cudaGetLastError(); 

if( cudaSuccess != err) 


	fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 





I need to use double precision variables for a research project at school (physics department).

Thank you.

Only Compute 1.3-capable cards support double precision (so the GTX 260, 280 and Tesla C1060). The 9800 GX2 supports Compute 1.1, so it won’t work with doubles.

Thank you for your prompt response.

dear sir

this is buj… i am very much intrested to learn cuda… i know how to allocate memory and copy data from host to device for 1 DIMENSIONAL ARRAY… but i am confusing about 2 DIMENSIONAL ARRAYS . how to declare variables for Host and Device , how to allocate memory for host and device , how to copy data from host to device … plse can you give some idea for this A[1024][1024]… tel me how to aceese threads for this A[1024][1024] at a time

plse help me kindly…