How you allocated a matrix on device?

Hi, how the topic title, I want know how you allocated a matrix on device? For example:

#include <stdio.h>

// Kernel definition
global void MatAdd(float A[N][N], float B[N][N],float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}

int main()
{
float **A, **B, **C;
int i,j;

A=(float**)malloc(10*sizeof(float*));
for(i=0; i<10; i++)
	A[i] = (float*)malloc(10*sizeof(float));
B=(float**)malloc(10*sizeof(float*));
for(i=0; i<10; i++)
	B[i] = (float*)malloc(10*sizeof(float));

//And C on device how allocated?

// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

}

Thanks a lot!

Hello,

The simplest way is allocate the matrix in row-major order: http://en.wikipedia…Row-major_order

// size of the matrix

unsigned int sizem = N * N * sizeof(float);

// pointer to the matrix

float* A = NULL;

// allocate NxN element of type float

cudaMalloc( (void**) &A, sizem);

// The same for matrix B and C

Configure and launch your kernel.

int numBlocks = 1;

 dim3 threadsPerBlock(N, N);

 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C, N);

The kernel must index the data in row-major order

// Kernel definition

__global__ void MatAdd(float* A, float* B, float* C, N)

{

	// map from threadIdx/BlockIdx to row-major order

	int col = threadIdx.x + blockIdx.x * blockDim.x;

	int row = threadIdx.y + blockIdx.y * blockDim.y;

	// calculate the row-major index

	int gid = row*N + col;

	C[gid] = A[gid] + B[gid];

}

Isn’t there a way to use the matrix with the structure a double ? Example:

// Kernel definition

global void MatAdd(float A[N][N], float B[N][N],float C[N][N])

{

int i = threadIdx.x;

int j = threadIdx.y;

C[i][j] = A[i][j] + B[i][j];

}

No. You will have to always access the matrix as mentioned before:

int i = threadIdx.x;

int j = threadIdx.y;

offset=j+i*N;

C[offset]=A[offset]+B[offset];

In order to allocate on device you just use the following:

float *A;

cudaMalloc((void**)&A, sizeof(float)NN);

So far is the only way. The line “float **A;” has different meaning in CUDA C than the usual 2D matrix.

Just note your access pattern is not coalesced. To improve performance, data in global memory must be accessed in a ‘coalesce way’.

int i = threadIdx.x;

int j = threadIdx.y;

offset=i+j*N; // coalesced :)

C[offset]=A[offset]+B[offset];

Thanks. Good point. I always mix FORTRAN and C formats.