Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!&

imperialguy · April 13, 2009, 3:38am

I have too many questions and complications regarding shared memory. I cant just write it down here. If someone can help me via chatting I would be grateful to them. Please please respond. My aim/yahoo chat id is catsize7. I am really depressed with the shared memory.

Mr_Nuke · April 13, 2009, 3:36pm

Hi, sorry I can’t go live right now, but I’ll try to put forth a brief explanantion.

You can look at shared memory as the L1 cache in your GPU. The differemnce is that you have full control of it.

Let’s hope this example sheds some light:
You have n elements to process, and for each, you need to read p elements of s bytes each. If you were reading from global memory, you would need to read (n * p * s) bytes to process n elements. The global memory will choke your kernel to death.

So instead of reading ps from global memory for each element, you only read it once for the entire block. Let’s say you have a block size of 128. With shared memory you are only reading (nps)/128 bytes from global memory, ps bytes for each block, rather than each thread. For the same memory bandwidth, you are reducing the transferred data by over two orders of magnitude. The shared memory’s latency is very small compared to the global memory’s, so you also have less latency to worry about once the data arrives.

Shared memory does what UPS does. It takes the packages and puts thousands of them on one truck, instead of of driving thousands of trucks. There’s only one exit, and each truck has to wait in line to leave. If you put thousands of trucks, you’ll wait a very long while for all of them to leave, but if you put all packages in one truck, there’s almost no delay. Of course, there’s the cost of loading all the packages in the one truck (reading from global to shared mem), but overall, the benefit is enourmous.

Hope this helps,
Alex

imperialguy · April 13, 2009, 10:16pm

hi,

thanks for the reply. my main problem is with the practical implementation of the shared memory - the coding aspect. i understood the theoretical part of it. I would like to discuss the following code which aims at transposing a matrix using shared memory:

#include "stdafx.h"

#include <stdio.h>

#include <conio.h>

#include <cuda.h>

#define block_size_x 16

#define block_size_y 16

#define BLOCK_DIM 16

__global__ void transpose(float *out, float *in, int w, int h)

{

	

	__shared__ float block[BLOCK_DIM*BLOCK_DIM];

	//Block Index

	int bx=blockIdx.x;

	int by=blockIdx.y;

	//Thread Index

	int tx=threadIdx.x;

	int ty=threadIdx.y;

	//Row and Column of output

	int row=by*block_size_y+ty;

	int col=bx*block_size_x+tx;

	int block_index=ty*BLOCK_DIM+tx;

	int transpose_index=tx*BLOCK_DIM+ty;

	

	int col1,input_index,output_index,tem;

	

	for(int m=0;m<w/BLOCK_DIM;++m)

	{

		

		col1=m*BLOCK_DIM+tx;

		input_index=row*w+col1;

		block[block_index]=in[input_index];

		__syncthreads();

		

	out[input_index]=block[transpose_index];	

	}

	

}

int main(int agrc, char* argv[])

{

	float *in,*out,*in_gpu,*out_gpu;

	int h=16,w=16,i,j;

	size_t size=h*w;

	FILE *ip,*op;

	

	ip=fopen("D:\input_matrix.txt","w+");

	op=fopen("D:\output_matrix.txt","w+");

	in=(float*)calloc(h*w,sizeof(float));

	out=(float*)calloc(h*w,sizeof(float));

	

	for(i=0;i<h;i++)

	{

		for(j=0;j<w;j++)

		{

			in[i*w+j]=rand()%10+1;

			fprintf(ip,"%.2f	",in[i*w+j]);

		}

		fprintf(ip,"\n");

	}

	

	cudaMalloc((void **) &in_gpu, size);

	cudaMalloc((void **) &out_gpu, size);

	cudaMemcpy(in_gpu, in, size, cudaMemcpyHostToDevice);

	// Define number of threads in a block i.e., block size (dimBlock) and number of blocks in a grid i.e., grid size (dimGrid)

	dim3 dimBlock(block_size_x,block_size_y);

	dim3 dimGrid(w/block_size_x,h/block_size_y);

	transpose<<<dimGrid, dimBlock>>>(out_gpu,in_gpu,w,h);

	cudaMemcpy(out, out_gpu, size, cudaMemcpyDeviceToHost);

	cudaFree(in_gpu);

	cudaFree(out_gpu);

	

	for(i=0;i<w;i++)

	{

		for(j=0;j<h;j++)

		{

			fprintf(op,"%.2f	",out[i*h+j]);

		}

		fprintf(op,"\n");

	}

	

	free(in);

	free(out);

	getche();

	return 0;

}

Could you tell me what is wrong with it. You can see what it is producing by running it in you’re system. I would like to talk in detail about this on the chat. Just tell me when you will be available online, or leave me you’re ID or add me and let me know where did u add me - like yahoo/aim. Also, my MSN id is catsize7@hotmail.com

Hi, sorry I can’t go live right now, but I’ll try to put forth a brief explanantion.

You can look at shared memory as the L1 cache in your GPU. The differemnce is that you have full control of it.

Let’s hope this example sheds some light:

You have n elements to process, and for each, you need to read p elements of s bytes each. If you were reading from global memory, you would need to read (n * p * s) bytes to process n elements. The global memory will choke your kernel to death.

So instead of reading ps from global memory for each element, you only read it once for the entire block. Let’s say you have a block size of 128. With shared memory you are only reading (nps)/128 bytes from global memory, ps bytes for each block, rather than each thread. For the same memory bandwidth, you are reducing the transferred data by over two orders of magnitude. The shared memory’s latency is very small compared to the global memory’s, so you also have less latency to worry about once the data arrives.

Shared memory does what UPS does. It takes the packages and puts thousands of them on one truck, instead of of driving thousands of trucks. There’s only one exit, and each truck has to wait in line to leave. If you put thousands of trucks, you’ll wait a very long while for all of them to leave, but if you put all packages in one truck, there’s almost no delay. Of course, there’s the cost of loading all the packages in the one truck (reading from global to shared mem), but overall, the benefit is enourmous.

Hope this helps,

Alex

Topic		Replies	Views
life span of shared memory CUDA Programming and Performance	15	6943	April 27, 2011
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	702	September 29, 2018
Using Shared Memory in CUDA C/C++ Technical Blog	36	1955	October 8, 2020
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16146	July 21, 2017
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16291	January 30, 2011
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
Global memory access bottleneck CUDA Programming and Performance	8	3415	September 4, 2015
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6677	June 2, 2012
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7552	July 21, 2008
Memory size in 'real problem' sizes?! CUDA Programming and Performance	6	6920	May 31, 2011

Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!&

Related topics