Simple question regarding mem allocation/copy

Hello. I’m working on a program to apply a blur algorithm to a matrix. First I’ve used a 1D array and stored the matrix in row-major order and it worked fine, but now I want to modify it to use 2D arrays.

float **h_iA, **d_iA;

h_iA = (float**)malloc(sizeof(float*)*N);

for (int m = 0; m < N; m++)

	h_iA[m] = (float*)malloc(sizeof(float)*N);

InitMat(h_iA, N);

cudaMalloc((void**) &d_iA, sizeof(float*)*N);

for (int u =0; u < N; u++)

	cudaMalloc((void **) &d_iA[u], sizeof(float)*N);

for (int i=0;i<N;i++)

	cudaMemcpy(d_iA, h_iA, sizeof(float*)*N , cudaMemcpyHostToDevice);

for (int i=0;i<N;i++)

	cudaMemcpy(d_iA[i], h_iA[i], sizeof(float)*N , cudaMemcpyHostToDevice)

Is this the proper way to allocate the device memory for d_iA and copy the data from h_iA to it?


cudaMemcpy(d_iA[i], h_iA[i], sizeof(float)*N , cudaMemcpyHostToDevice)

is invalid since d_iA[i] is address of device memory, it cannot be accessed in host code.

(d_iA is valid since d_iA is address in host memory, although its content is an address of device memory)

why not use 1-D array with 2-D row-major index?

So, everything else is correct and that line isn’t needed?

The offset multiplications tend to be a bit heavy with a large amount of data.

This is not correct, if you use pointer array to implement 2-D array A, then

if you want to fetch A[i][j], then you must fetch pointer A[i] first,

you must pay 500 cycles, however if you use 2-D index, then you just pay

several MAD operations in index computation, for example

const int inx = threadIdx.x;

	const int iny = threadIdx.y;

	const int ibx = blockIdx.x * BLOCK_SIZE_X;

	const int iby = blockIdx.y * BLOCK_SIZE_Y;

// gid is 1-D index of array under row-major 	

	int  gid = ( ibx * BLOCK_SIZE_X + inx )  +  ( iby * BLOCK_SIZE_Y + iny ) * wA;

if you can invoke more than 192 active threads in one SM, then every MAD only costs 1/8 cycle.