The Cuda Programming Guide Samples Errors

Hi All,

On Page 18 there is a sample for vector Add the code is below. I would have expected the code to work as is. Then i noticed a bunch of variables missing. Once I fixed those the code still did not work. Then I realized that the logic in the kernel was wrong. The threads are now broken by 256 threads per block and a number of blocks depending on N. The logic actually should be:

i = threadIdx.x + blockDim.x * blockIdx.x;

Where do we post any corrections to the cuda programming guide?

Ramesh

// Device code
global void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
// Host code
int main()
{
// Allocate vectors in device memory
size_t size = N * sizeof(float);
float* d_A;
cudaMalloc((void**)&d_A, size);
float* d_B;
cudaMalloc((void**)&d_B, size);
float* d_C;
cudaMalloc((void**)&d_C, size);
// Copy vectors from host memory to device memory
// h_A and h_B are input vectors stored in host memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock – 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C);
// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

What milkyway1 told was right. There’s an error in the CUDA Programming Guide. But I think the deal of this part is to show how to use memory management functions.

I also found errors in programming guide as well. Sometimes it is annoying, at least when you were writing your first couple cuda scripts.
It’s good people can point them out, and they would be fixed with everybody’s help (at least I hope so)

Yes, please continue reporting any bugs you find and we will try and fix the documentation ASAP.

Will do so as I go through the guide.

The example on dynamic shared memory allocation in Section B.2.3 is not wrong as it stands, but it misses a discussion on alignment. It should make clear that a slight modification of the example such as this one doesn’t work:

extern __shared__ char array[];

__device__ void func()	  // __device__ or __global__ function

{

	short* array0 = (short*)array;

	float* array1 = (float*)&array0[127];

	int*   array2 =   (int*)&array1[64];

}