Example of Matrix Multiplication(from cuda book) points that i dont anderstend ...

can.not.lose · February 3, 2008, 5:11am

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

need help this is from the cuda book …

// Thread block size

#define BLOCK_SIZE 16

// Forward declaration of the device multiplication function

__global__ void Muld(float*, float*, int, int, float*);

// Host multiplication function

// Compute C = A * B

// hA is the height of A

// wA is the width of A

// wB is the width of B

void Mul(const float* A, const float* B, int hA, int wA, int wB,

float* C)

{

int size;

// Load A and B to the device

float* Ad;

size = hA * wA * sizeof(float);

cudaMalloc((void**)&Ad, size);

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);

float* Bd;

size = wA * wB * sizeof(float);

cudaMalloc((void**)&Bd, size);

cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// Allocate C on the device

float* Cd;

size = hA * wB * sizeof(float);

cudaMalloc((void**)&Cd, size);

// Compute the execution configuration assuming

// the matrix dimensions are multiples of BLOCK_SIZE

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// Launch the device computation

Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// Read C from the device

cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(Ad);

cudaFree(Bd);

cudaFree(Cd);

}

CUDA Programming Guide Version 1.1 69

Chapter 6. Example of Matrix Multiplication

// Device multiplication function called by Mul()

// Compute C = A * B

// wA is the width of A

// wB is the width of B

__global__ void Muld(float* A, float* B, int wA, int wB, float* C)

{

// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

int aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

int bStep = BLOCK_SIZE * wB;

// The element of the block sub-matrix that is computed

// by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

__syncthreads();

// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE; ++k)

70 CUDA Programming Guide Version 1.1

Chapter 6. Example of Matrix Multiplication

Csub += As[ty][k] * Bs[k][tx];

// Synchronize to make sure that the preceding

// computation is done before loading two new

// sub-matrices of A and B in the next iteration

__syncthreads();

}

// Write the block sub-matrix to global memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;

}

i have A few questions

int function Muld that go to the device we are initialize a array of

// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

this is start in the loop and end with in the loop

a)when we do it on the loop block this memory start for all the threads in the same place(address) in the shared memory and finish after the loop start again ?

so this memory start and finish in the same loop for all the threads ?

b)for my understanding and according to the book and i quote

as i understand the 2 arrays have the same address ?

10X all …love you :wub:

mikemayo21 · February 3, 2008, 6:58am

im replying to all the posts

DenisR · February 3, 2008, 7:47am

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

need help this is from the cuda book …

i have A few questions

int function Muld that go to the device we are initialize a array of
// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
this is start in the loop and end with in the loop

a)when we do it on the loop block this memory start for all the threads in the same place(address) in the shared memory and finish after the loop start again ?

so this memory start and finish in the same loop for all the threads ?

b)for my understanding and according to the book and i quote

as i understand the 2 arrays have the same address ?

10X all …love you :wub:

[snapback]319461[/snapback]

shared memory is the same memory for all threads in 1 block.

The arrays do not share the same memory, that is only true for extern shared memory. When using extern shared memory you have to make sure that you split up the block of memory in the right number of parts you need. The size is determined by the third parameter of your kernel call like this:

myKernel<<<blocks, threads, SHARED_MEMORY_SIZE>>>(params)

can.not.lose · February 3, 2008, 9:28am

is the maximum that i am using

this need to be the

extern shared memory that i am declaring in the thread

shared memory that i declaring in the thread

=

is SHARED_MEMORY_SIZE from

myKernel<<<blocks, threads, SHARED_MEMORY_SIZE>>>(params)

if i am in the right way ? External Media

DenisR · February 3, 2008, 9:49am

As far as I know not. You can have

shared float first[256];
extern__shared__ float second;
extern shared float third;

third = &second[128];

and call you kernel like mykernel<<<grid, threads, 512*sizeof(float)>>>

and you will have a first of size 256 and a second of size 512 (but you should only use the first 128, otherwise you overwrite the contents of third) and your third will then be 384 big.

You have to check in your host code if 256*sizeof(float) + the shared mem you allocate from the host with the kernel call, is not too large for your grid & threads dimensions. You can use the occupancy calculator beforehand to see what your limits are.

can.not.lose · February 3, 2008, 10:37am

1)just to make sure that i understand.

__shared__ float first[256]; this is unattached

[/CODE]

but the

extern__shared__ float second[];

extern __shared__ float third[];

using the same memory address and the size is float(512-256) =float(256) .

like if i say that

second[x]=third[x] 

that is &third=&second

:ph34r: this is right ?

2)if i doing something like that ?

__shared__ float first[256]; this is unattached 

extern__shared__ float second[];

extern __shared__ float third[];

__shared__ float forth[256]; this is unattached

this will work ?

External Image External Image External Image

DenisR · February 3, 2008, 3:01pm

1)just to make sure that i understand.
__shared__ float first[256]; this is unattached
[/CODE]

but the
extern__shared__ float second[];

extern __shared__ float third[];
using the same memory address and the size is float(512-256) =float(256) .

like if i say that
second[x]=third[x] 

that is &third=&second
:ph34r: this is right ?

You would have to check the programming guide, but I think the site is float(512), but second = third indeed

2)if i doing something like that ?
__shared__ float first[256]; this is unattached 

extern__shared__ float second[];

extern __shared__ float third[];

__shared__ float forth[256]; this is unattached
this will work ?

External Media External Media External Media

[snapback]319572[/snapback]

As far as I know yes.

Topic		Replies	Views
multiplication of matrix using shared memory problem of multiplication CUDA Programming and Performance	2	3937	September 30, 2010
Better way to program Cuda CUDA Programming and Performance	3	1173	July 5, 2010
32 x 32 Matrix Multiplication CUDA Programming and Performance	2	2853	March 5, 2010
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3682	June 27, 2011
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9993	January 18, 2012
CUDA Matrix Multiplication Performance CUDA Programming and Performance	3	2884	December 6, 2015
Is there an error in the cuda manual matrix multiplication example? CUDA Programming and Performance	11	12876	December 1, 2016
Matrix multiplication CUDA Programming and Performance	7	2150	July 2, 2010
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	765	September 6, 2023
Efficient use of shared memory CUDA Programming and Performance	29	4171	December 2, 2019

Example of Matrix Multiplication(from cuda book) points that i dont anderstend ...

Related topics