__global__ void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
Where are the parameter variables A, B, C and N stored? Are they
stored in the global device memory? Or are they stored in thread block shared or even
in thread private memory?
The parameters (memory addresses for pointer variables A, B, and C, and the integer N) are stored in shared memory on compute capability 1.x devices, and in constant memory on compute capability 2.x and greater. “Constant memory” is just device memory accessed through a special constant cache on the GPU.
The arrays that A, B, C point to can be stored in device memory or (on newer devices that support this) in pagelocked system memory on the host.
The missing bit here is the device memory allocation. Typically, one uses cudaMalloc() on the host without a global device variable.
Example host code:
const int n = 1024;
float *d_A, *d_B, *d_C;
// Allocate device memory
cudaMalloc((void **) &d_A, sizeof(float) * n);
cudaMalloc((void **) &d_B, sizeof(float) * n);
cudaMalloc((void **) &d_C, sizeof(float) * n);
// Load data into the A and B arrays with cudaMemcpy() not show
// Launch kernel
const int threads_per_block = 64;
const int blocks = n / threads_per_block;
VecAdd<<<blocks, threads_per_block>>>(d_A, d_B, d_C, n);