shared memory example to be found easy example for vector dot product

is there an easy example to found regarding shared memory usage with vector dot product?

how to assign shared memory in kernel, etc.

i find some examples with so-called reduction and i am trying to find even more easier one.

please help !

an example like the following,

int idx = threadIdx.x + blockDim.x * blockId.x ;
oB[idx] = iA[idx] * iA[idx ];

how to convert this so that i can learn something about shared memory.

Thanks in advance

First of all, you have to understand 2 important features of shared memory.
The scope of shared memory is block scope. It mean that only threads within block can access to shared memory of there own block.
The size of shared memory is limited, that is 16KB for every multiprocessor (easy to understand we can say that one block has 16KB of shared memory).

Back to your question.
int idx = threadIdx.x + blockDim.x * blockId.x ;
oB[idx] = iA[idx] * iA[idx ];
the oB, iA are located on Global memory. And you want to use the shared memory.
Easy to understand, let assume that you have 2 blocks, one have 64 threads. oB and iA are integer arrays with 64 elements for each array.

First we copy the iA to shared memory.
shared int smiA[64];
smiA[threadIdx.x] = iA[idx];
then we do the calculations
oB[idx] = smiA[threadIdx.x] * smiA[threadIdx.x];
In this situation, the shared memoy may not get any advantage, however, instead load data from iA twice (about 500 to 600 clock cycle for once) we only load one time and write it in to shared memory. Then we use the stored data in shared memory for calculation (load data from shared memory is as fast as register if no bank conflict occur).
Hope it clear to you. :yes:

First of all, you have to understand 2 important features of shared memory.
The scope of shared memory is block scope. It mean that only threads within block can access to shared memory of there own block.
The size of shared memory is limited, that is 16KB for every multiprocessor (easy to understand we can say that one block has 16KB of shared memory).

Back to your question.
int idx = threadIdx.x + blockDim.x * blockId.x ;
oB[idx] = iA[idx] * iA[idx ];
the oB, iA are located on Global memory. And you want to use the shared memory.
Easy to understand, let assume that you have 2 blocks, one have 64 threads. oB and iA are integer arrays with 64 elements for each array.

First we copy the iA to shared memory.
shared int smiA[64];
smiA[threadIdx.x] = iA[idx];
then we do the calculations
oB[idx] = smiA[threadIdx.x] * smiA[threadIdx.x];
In this situation, the shared memoy may not get any advantage, however, instead load data from iA twice (about 500 to 600 clock cycle for once) we only load one time and write it in to shared memory. Then we use the stored data in shared memory for calculation (load data from shared memory is as fast as register if no bank conflict occur).
Hope it clear to you. :yes:

Hi,

I’m guessing you would also want to reduce these values?

float val_in_reg = iA[idx];

shared float smem[32]; // if blockSize = 32

val_in_reg = iA[idx];
val_in_reg *= val_in_reg;

smem[threadIdx.x] = val_in_reg;

// reduce - assuming blockSize = 32

smem[threadIdx.x] += smem[threadIdx.x + 32];
smem[threadIdx.x] += smem[threadIdx.x + 16];
smem[threadIdx.x] += smem[threadIdx.x + 8];
smem[threadIdx.x] += smem[threadIdx.x + 4];
smem[threadIdx.x] += smem[threadIdx.x + 2];
smem[threadIdx.x] += smem[threadIdx.x + 1];

// output val is stored in smem[0]

if(threadIdx.x == 0)
oB[idx] = smem[0];

Above is one way of writing it. But for this to be truly effective with only 32 threads you would want to do multiple loads over more elements of the vector.

something like:

float val = 0;
float reg_val = 0;
for(int i = 0; i < num_els_per_block/num_threads; i++)
{

  val = iA[threadIdx.x + blockIdx.x*num_els_per_block+ i*num_threads];

  reg_val = reg_val + val*val;

}

smem[threadiDx.x] = reg_val;

// reduce again like above…

Hi,

I’m guessing you would also want to reduce these values?

float val_in_reg = iA[idx];

shared float smem[32]; // if blockSize = 32

val_in_reg = iA[idx];
val_in_reg *= val_in_reg;

smem[threadIdx.x] = val_in_reg;

// reduce - assuming blockSize = 32

smem[threadIdx.x] += smem[threadIdx.x + 32];
smem[threadIdx.x] += smem[threadIdx.x + 16];
smem[threadIdx.x] += smem[threadIdx.x + 8];
smem[threadIdx.x] += smem[threadIdx.x + 4];
smem[threadIdx.x] += smem[threadIdx.x + 2];
smem[threadIdx.x] += smem[threadIdx.x + 1];

// output val is stored in smem[0]

if(threadIdx.x == 0)
oB[idx] = smem[0];

Above is one way of writing it. But for this to be truly effective with only 32 threads you would want to do multiple loads over more elements of the vector.

something like:

float val = 0;
float reg_val = 0;
for(int i = 0; i < num_els_per_block/num_threads; i++)
{

  val = iA[threadIdx.x + blockIdx.x*num_els_per_block+ i*num_threads];

  reg_val = reg_val + val*val;

}

smem[threadiDx.x] = reg_val;

// reduce again like above…

I finally get some advice. Thanks, Quoc Vinh.

Here are more ?s if you don’t mind.

Now I know I understand what you mentioned;
“The scope of shared memory is block scope. It mean that only threads within block can access to shared memory of there own block.”

And now I understand why the array multiplication example in the book, “Programming Massively Parallel Processors” by David B. Kirk & Wen-mei W. Hwu, defines a temporary 2D array for shared memory like

             _shared_float Mds[TILE_WIDTH][TILE_WIDTH]     <<  exact block size.

because the following was defined in the host code

             dim3 dimBlock(TILE_WIDTH,TILE_WIDTH);           << defined in the host code

Back to my case; what if I define the following for iA array in my host code:

             dim3 dimBlock(3,3,33) 

and

int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
int k = tjreadIdx.z;

to represent 1D vector iA in terms of i,j,k such as iA(i+j+k) << a conversion of 3d array into 1d

in this case, how should I define a shared memory ?

I think that since I already defined dimBlock(3,3,33), I thought I should define shared int smiA[3,3,33];
Or, form is not important as long as I define 3x3x33=297 like shared int smiA[9,33] or shared int smiA[297]?

Thanks again well in advance for your reply and valuable comments!

I finally get some advice. Thanks, Quoc Vinh.

Here are more ?s if you don’t mind.

Now I know I understand what you mentioned;
“The scope of shared memory is block scope. It mean that only threads within block can access to shared memory of there own block.”

And now I understand why the array multiplication example in the book, “Programming Massively Parallel Processors” by David B. Kirk & Wen-mei W. Hwu, defines a temporary 2D array for shared memory like

             _shared_float Mds[TILE_WIDTH][TILE_WIDTH]     <<  exact block size.

because the following was defined in the host code

             dim3 dimBlock(TILE_WIDTH,TILE_WIDTH);           << defined in the host code

Back to my case; what if I define the following for iA array in my host code:

             dim3 dimBlock(3,3,33) 

and

int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
int k = tjreadIdx.z;

to represent 1D vector iA in terms of i,j,k such as iA(i+j+k) << a conversion of 3d array into 1d

in this case, how should I define a shared memory ?

I think that since I already defined dimBlock(3,3,33), I thought I should define shared int smiA[3,3,33];
Or, form is not important as long as I define 3x3x33=297 like shared int smiA[9,33] or shared int smiA[297]?

Thanks again well in advance for your reply and valuable comments!

I never try 3D array in shared memory before.
However, with my habit, I will define 3D shared memory array as 1D array

[i]dim3 block(3, 3, 33);

shared int smiA[3x3x33];

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

int smIndex = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;
int glIndex = z * blockDim.x * blockDim.y + y * blockDim.x + x;[/i]

The most important point here is you must to understand the location of smiA[smIndex] and iA[glIndex] on your arrays, so you can access data correctly.
By the way, if you try to optimize your program, you should pay attention on the number of thread per block.

I never try 3D array in shared memory before.
However, with my habit, I will define 3D shared memory array as 1D array

[i]dim3 block(3, 3, 33);

shared int smiA[3x3x33];

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

int smIndex = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;
int glIndex = z * blockDim.x * blockDim.y + y * blockDim.x + x;[/i]

The most important point here is you must to understand the location of smiA[smIndex] and iA[glIndex] on your arrays, so you can access data correctly.
By the way, if you try to optimize your program, you should pay attention on the number of thread per block.

Again, your comments are THE help line to a person like me, first in CUDA…

I am still learning about what you mentioned and i have another question on shared memory usage.

i want to do the following using shared memory,

main ()

{

Muld <<< >>> ( Ad, Bd, Cd, &res);

}

global void Muld(A, B, C, float *res)

{

shared As

shared Bs

As = A;

Bs = B;

(doing other stuff using As, Bs)

*res += iAs[idx] * iBs[idx] ;

}

the point is i could have calculated ‘res’ in the main but i figured i could use As and Bs (those are already in memory) to speed up the whole process.

is it possible to transfer ‘res’ between main() and Mul()

Anyone’s comments are valuable to me.

Thanks in advance.

Please see the other post!

Again, your comments are THE help line to a person like me, first in CUDA…

I am still learning about what you mentioned and i have another question on shared memory usage.

i want to do the following using shared memory,

main ()

{

Muld <<< >>> ( Ad, Bd, Cd, &res);

}

global void Muld(A, B, C, float *res)

{

shared As

shared Bs

As = A;

Bs = B;

(doing other stuff using As, Bs)

*res += iAs[idx] * iBs[idx] ;

}

the point is i could have calculated ‘res’ in the main but i figured i could use As and Bs (those are already in memory) to speed up the whole process.

is it possible to transfer ‘res’ between main() and Mul()

Anyone’s comments are valuable to me.

Thanks in advance.

Please see the other post!

Where did the Ad, Bd, Cd, res locate (global memory or host memory)?
What kind of the Ad, Bd, Cd, res (variable or array)?
the same question for As, Bs.

If you want to get advice from someone else, why did you write a very hard to understand code. I cannot guess what are you trying to do.
Please make it clearly to understand.

Where did the Ad, Bd, Cd, res locate (global memory or host memory)?
What kind of the Ad, Bd, Cd, res (variable or array)?
the same question for As, Bs.

If you want to get advice from someone else, why did you write a very hard to understand code. I cannot guess what are you trying to do.
Please make it clearly to understand.

I made it way, way too simple . My apologies.

main ()
{
float* res

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &Cd, size);

Muld <<< , >>> ( Ad, Bd, Cd, &res);
print ‘res’ from host
}

global void Muld(A, B, C, float *res)
{
shared As
shared Bs

As = A;
Bs = B;


(doing other stuff using As, Bs)

res += iAs[idx] iBs[idx] ;
}

again the point is i could have calculated ‘res’ in the main using A and B but i figured i could use As and Bs, which are already in shared memory, to speed up the res calculation.

my question is the same.

Can I transfer ‘res’ from Mul()?

I made it way, way too simple . My apologies.

main ()
{
float* res

cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &Cd, size);

Muld <<< , >>> ( Ad, Bd, Cd, &res);
print ‘res’ from host
}

global void Muld(A, B, C, float *res)
{
shared As
shared Bs

As = A;
Bs = B;


(doing other stuff using As, Bs)

res += iAs[idx] iBs[idx] ;
}

again the point is i could have calculated ‘res’ in the main using A and B but i figured i could use As and Bs, which are already in shared memory, to speed up the res calculation.

my question is the same.

Can I transfer ‘res’ from Mul()?

The “res” variable can be accessed by all threads in grid?
I am confusing this point.
I think you shoud try with atomic functions or it has a good tutorial in the SDK, that is “Reduction” if I correct.

The “res” variable can be accessed by all threads in grid?
I am confusing this point.
I think you shoud try with atomic functions or it has a good tutorial in the SDK, that is “Reduction” if I correct.