Storage point of Kernel parameters

stevewilson · February 11, 2013, 1:47pm

Hello,

If I have got a kernel function like e.g.

__global__ void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

Where are the parameter variables A, B, C and N stored? Are they
stored in the global device memory? Or are they stored in thread block shared or even
in thread private memory?

seibert · February 11, 2013, 4:23pm

The parameters (memory addresses for pointer variables A, B, and C, and the integer N) are stored in shared memory on compute capability 1.x devices, and in constant memory on compute capability 2.x and greater. “Constant memory” is just device memory accessed through a special constant cache on the GPU.

The arrays that A, B, C point to can be stored in device memory or (on newer devices that support this) in pagelocked system memory on the host.

stevewilson · February 11, 2013, 5:25pm

So if I want a global variable to be passed in the kernel call, then I generate an static array on the device memory with

__device__ float *A_dev

and then just do

A = A_dev

? Right?

seibert · February 11, 2013, 10:32pm

The missing bit here is the device memory allocation. Typically, one uses cudaMalloc() on the host without a global device variable.

Example host code:

const int n = 1024;
float *d_A, *d_B, *d_C;

// Allocate device memory
cudaMalloc((void **) &d_A, sizeof(float) * n);
cudaMalloc((void **) &d_B, sizeof(float) * n);
cudaMalloc((void **) &d_C, sizeof(float) * n);

// Load data into the A and B arrays with cudaMemcpy() not show

// Launch kernel
const int threads_per_block = 64;
const int blocks = n / threads_per_block;
VecAdd<<<blocks, threads_per_block>>>(d_A, d_B, d_C, n);