Storage point of Kernel parameters


If I have got a kernel function like e.g.

__global__ void VecAdd(float* A, float* B, float* C, int N)
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];

Where are the parameter variables A, B, C and N stored? Are they
stored in the global device memory? Or are they stored in thread block shared or even
in thread private memory?

The parameters (memory addresses for pointer variables A, B, and C, and the integer N) are stored in shared memory on compute capability 1.x devices, and in constant memory on compute capability 2.x and greater. “Constant memory” is just device memory accessed through a special constant cache on the GPU.

The arrays that A, B, C point to can be stored in device memory or (on newer devices that support this) in pagelocked system memory on the host.

So if I want a global variable to be passed in the kernel call, then I generate an static array on the device memory with

__device__ float *A_dev

and then just do

A = A_dev

? Right?

The missing bit here is the device memory allocation. Typically, one uses cudaMalloc() on the host without a global device variable.

Example host code:

const int n = 1024;
float *d_A, *d_B, *d_C;

// Allocate device memory
cudaMalloc((void **) &d_A, sizeof(float) * n);
cudaMalloc((void **) &d_B, sizeof(float) * n);
cudaMalloc((void **) &d_C, sizeof(float) * n);

// Load data into the A and B arrays with cudaMemcpy() not show

// Launch kernel
const int threads_per_block = 64;
const int blocks = n / threads_per_block;
VecAdd<<<blocks, threads_per_block>>>(d_A, d_B, d_C, n);