Hi, how the topic title, I want know how you allocated a matrix on device? For example:
#include <stdio.h>
// Kernel definition
global void MatAdd(float A[N][N], float B[N][N],float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
float **A, **B, **C;
int i,j;
A=(float**)malloc(10*sizeof(float*));
for(i=0; i<10; i++)
A[i] = (float*)malloc(10*sizeof(float));
B=(float**)malloc(10*sizeof(float*));
for(i=0; i<10; i++)
B[i] = (float*)malloc(10*sizeof(float));
//And C on device how allocated?
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
Thanks a lot!
pQB
November 21, 2011, 2:28pm
2
Hello,
The simplest way is allocate the matrix in row-major order: http://en.wikipedia…Row-major_order
// size of the matrix
unsigned int sizem = N * N * sizeof(float);
// pointer to the matrix
float* A = NULL;
// allocate NxN element of type float
cudaMalloc( (void**) &A, sizem);
// The same for matrix B and C
Configure and launch your kernel.
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C, N);
The kernel must index the data in row-major order
// Kernel definition
__global__ void MatAdd(float* A, float* B, float* C, N)
{
// map from threadIdx/BlockIdx to row-major order
int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the row-major index
int gid = row*N + col;
C[gid] = A[gid] + B[gid];
}
Hello,
The simplest way is allocate the matrix in row-major order: http://en.wikipedia…Row-major_order
// size of the matrix
unsigned int sizem = N * N * sizeof(float);
// pointer to the matrix
float* A = NULL;
// allocate NxN element of type float
cudaMalloc( (void**) &A, sizem);
// The same for matrix B and C
Configure and launch your kernel.
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C, N);
The kernel must index the data in row-major order
// Kernel definition
__global__ void MatAdd(float* A, float* B, float* C, N)
{
// map from threadIdx/BlockIdx to row-major order
int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the row-major index
int gid = row*N + col;
C[gid] = A[gid] + B[gid];
}
Isn’t there a way to use the matrix with the structure a double ? Example:
// Kernel definition
global void MatAdd(float A[N][N], float B[N][N],float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
Isn’t there a way to use the matrix with the structure a double ? Example:
// Kernel definition
global void MatAdd(float A[N][N], float B[N][N],float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
No. You will have to always access the matrix as mentioned before:
int i = threadIdx.x;
int j = threadIdx.y;
offset=j+i*N;
C[offset]=A[offset]+B[offset];
In order to allocate on device you just use the following:
float *A;
cudaMalloc((void**)&A, sizeof(float)N N);
So far is the only way. The line “float **A;” has different meaning in CUDA C than the usual 2D matrix.
pQB
November 21, 2011, 3:09pm
5
Just note your access pattern is not coalesced. To improve performance, data in global memory must be accessed in a ‘coalesce way’.
int i = threadIdx.x;
int j = threadIdx.y;
offset=i+j*N; // coalesced :)
C[offset]=A[offset]+B[offset];
Just note your access pattern is not coalesced. To improve performance, data in global memory must be accessed in a ‘coalesce way’.
int i = threadIdx.x;
int j = threadIdx.y;
offset=i+j*N; // coalesced :)
C[offset]=A[offset]+B[offset];
Thanks. Good point. I always mix FORTRAN and C formats.