Example of Matrix Multiplication(from cuda book) points that i dont anderstend ...

<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

need help this is from the cuda book …

// Thread block size

#define BLOCK_SIZE 16

// Forward declaration of the device multiplication function

__global__ void Muld(float*, float*, int, int, float*);

// Host multiplication function

// Compute C = A * B

// hA is the height of A

// wA is the width of A

// wB is the width of B

void Mul(const float* A, const float* B, int hA, int wA, int wB,

float* C)

{

int size;

// Load A and B to the device

float* Ad;

size = hA * wA * sizeof(float);

cudaMalloc((void**)&Ad, size);

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);

float* Bd;

size = wA * wB * sizeof(float);

cudaMalloc((void**)&Bd, size);

cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// Allocate C on the device

float* Cd;

size = hA * wB * sizeof(float);

cudaMalloc((void**)&Cd, size);

// Compute the execution configuration assuming

// the matrix dimensions are multiples of BLOCK_SIZE

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);

// Launch the device computation

Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);

// Read C from the device

cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(Ad);

cudaFree(Bd);

cudaFree(Cd);

}

CUDA Programming Guide Version 1.1 69

Chapter 6. Example of Matrix Multiplication

// Device multiplication function called by Mul()

// Compute C = A * B

// wA is the width of A

// wB is the width of B

__global__ void Muld(float* A, float* B, int wA, int wB, float* C)

{

// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

int aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

int bStep = BLOCK_SIZE * wB;

// The element of the block sub-matrix that is computed

// by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

__syncthreads();

// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE; ++k)

70 CUDA Programming Guide Version 1.1

Chapter 6. Example of Matrix Multiplication

Csub += As[ty][k] * Bs[k][tx];

// Synchronize to make sure that the preceding

// computation is done before loading two new

// sub-matrices of A and B in the next iteration

__syncthreads();

}

// Write the block sub-matrix to global memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;

}

i have A few questions

  1. int function Muld that go to the device we are initialize a array of
// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

this is start in the loop and end with in the loop

a)when we do it on the loop block this memory start for all the threads in the same place(address) in the shared memory and finish after the loop start again ?

so this memory start and finish in the same loop for all the threads ?

b)for my understanding and according to the book and i quote

as i understand the 2 arrays have the same address ?

10X all …love you :wub:

im replying to all the posts

shared memory is the same memory for all threads in 1 block.

The arrays do not share the same memory, that is only true for extern shared memory. When using extern shared memory you have to make sure that you split up the block of memory in the right number of parts you need. The size is determined by the third parameter of your kernel call like this:

myKernel<<<blocks, threads, SHARED_MEMORY_SIZE>>>(params)

is the maximum that i am using

this need to be the

extern shared memory that i am declaring in the thread

shared memory that i declaring in the thread

=

is SHARED_MEMORY_SIZE from

myKernel<<<blocks, threads, SHARED_MEMORY_SIZE>>>(params)

if i am in the right way ? External Media

As far as I know not. You can have

shared float first[256];
extern__shared__ float second;
extern shared float third;

third = &second[128];

and call you kernel like mykernel<<<grid, threads, 512*sizeof(float)>>>

and you will have a first of size 256 and a second of size 512 (but you should only use the first 128, otherwise you overwrite the contents of third) and your third will then be 384 big.

You have to check in your host code if 256*sizeof(float) + the shared mem you allocate from the host with the kernel call, is not too large for your grid & threads dimensions. You can use the occupancy calculator beforehand to see what your limits are.

1)just to make sure that i understand.

__shared__ float first[256]; this is unattached

[/CODE]

but the

extern__shared__ float second[];

extern __shared__ float third[];

using the same memory address and the size is float(512-256) =float(256) .

like if i say that

second[x]=third[x] 

that is &third=&second

:ph34r: this is right ?

2)if i doing something like that ?

__shared__ float first[256]; this is unattached 

extern__shared__ float second[];

extern __shared__ float third[];

__shared__ float forth[256]; this is unattached

this will work ?

External Image External Image External Image

You would have to check the programming guide, but I think the site is float(512), but second = third indeed

As far as I know yes.