I’m very new to cuda. I just make my program to triple value of given matrix.
The program works fine after compilation.
I would like to know if there is a better way to do it. Your view will help me a lot
Here is the entire program:
//This is a sample program which triples the values of a given matrix
#include<stdio.h>
#include<cuda.h>
global void triple_kernel(float *a, int M, int N)
{
//setup block index
int bIdy = blockIdx.y;
//set up threads index
int tIdx = threadIdx.x;
int tIdy = threadIdx.y;
//triple values for each block
//methode 1: This uses the shared memory based on sample code provide
//in matrix multiplication
//first submatrice to be compute by block
int aBegin = M * BLOCK_SIZE *bIdy;
//last submatrix to be computed
int aEnd = aBegin + N -1;
//step to go the next submatrix
int step = BLOCK_SIZE;
//start computation on all A's submatrix
for(int i = aBegin; i <= aEnd; i+= step)
{
//declare the submatrix in shared memory
__shared__ float AS[BLOCK_SIZE][BLOCK_SIZE];
//initialize the submatrix of A in shared memory
AS[tIdx][tIdy] = a[i + M * tIdy + tIdx];
//synchronize
__syncthreads();
//start operation in the submatrix
AS[tIdx][tIdy] = AS[tIdx][tIdy] * AS[tIdx][tIdy] * AS[tIdx][tIdy];
//synchronize
__syncthreads();
//update the original matrix
a[i + M * tIdy + tIdx] = AS[tIdx][tIdy];
}
First of all, why is this in the Linux support forum instead of the General CUDA programming forum?
Second, with regard to your query, if I understood right, what you’re trying to do, is simply have each cell in a matrix powered by 3? If so, why use shared memory at all?
__global__ void matPow3(float *mat, unsigned int szX, unsignedint szY)
{
unsigned int tx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int ty = blockIdx.y * blockDim.y + threadIdx.y;
// Check that this thread is designated withing matrix's boundaries.
if (tx < SzX && ty < SzY)
{
// Compiler should be able to implicitly use a register if you just do: mat[ty * SzX + tx] = mat[ty * SzX + tx] * mat[ty * SzX + tx] * mat[ty * SzX + tx];
// but, we'll write it explicitly for brevity.
float val = mat[ty * SzX + tx];
mat[ty * SzX + tx] = val * val * val;
}
}
.
.
.
// Kernel invocation
dim3 blockSz(16, 16); // This is usually a good starting point for the block size.
unsigned int gx = szX / blockSz.x + (szX % blockSz.x > 0 ? 1 : 0); // divide the matrix size by the block size, and add 1 if there's a remainder.
unsigned int gy = szY / blockSz.y + (szY % blockSz.y > 0 ? 1 : 0); // divide the matrix size by the block size, and add 1 if there's a remainder.
dim3 gridSz(gx, gy);
matPow3<<< gridSz, blockSz >>>(mat, szX, szY);
First of all, why is this in the Linux support forum instead of the General CUDA programming forum?
Second, with regard to your query, if I understood right, what you’re trying to do, is simply have each cell in a matrix powered by 3? If so, why use shared memory at all?
__global__ void matPow3(float *mat, unsigned int szX, unsignedint szY)
{
unsigned int tx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int ty = blockIdx.y * blockDim.y + threadIdx.y;
// Check that this thread is designated withing matrix's boundaries.
if (tx < SzX && ty < SzY)
{
// Compiler should be able to implicitly use a register if you just do: mat[ty * SzX + tx] = mat[ty * SzX + tx] * mat[ty * SzX + tx] * mat[ty * SzX + tx];
// but, we'll write it explicitly for brevity.
float val = mat[ty * SzX + tx];
mat[ty * SzX + tx] = val * val * val;
}
}
.
.
.
// Kernel invocation
dim3 blockSz(16, 16); // This is usually a good starting point for the block size.
unsigned int gx = szX / blockSz.x + (szX % blockSz.x > 0 ? 1 : 0); // divide the matrix size by the block size, and add 1 if there's a remainder.
unsigned int gy = szY / blockSz.y + (szY % blockSz.y > 0 ? 1 : 0); // divide the matrix size by the block size, and add 1 if there's a remainder.
dim3 gridSz(gx, gy);
matPow3<<< gridSz, blockSz >>>(mat, szX, szY);
Thank you for your answer.
I’m using CUDA on fd 10. This is why i sent it here.
I used the shared memory because i thought i should be
faster to do so than loading directly from the global memory.
But carefully thinking, I Think you are totally right since
there is no reuse of the data.
Thank you very much. I’m on my way and hope to help other newbies
as soon as possible.