Cant understand Shared Memory Concept ! I want to talk Live to somebody who knows it !!&

I have too many questions and complications regarding shared memory. I cant just write it down here. If someone can help me via chatting I would be grateful to them. Please please respond. My aim/yahoo chat id is catsize7. I am really depressed with the shared memory.

Hi, sorry I can’t go live right now, but I’ll try to put forth a brief explanantion.

You can look at shared memory as the L1 cache in your GPU. The differemnce is that you have full control of it.

Let’s hope this example sheds some light:
You have n elements to process, and for each, you need to read p elements of s bytes each. If you were reading from global memory, you would need to read (n * p * s) bytes to process n elements. The global memory will choke your kernel to death.

So instead of reading ps from global memory for each element, you only read it once for the entire block. Let’s say you have a block size of 128. With shared memory you are only reading (nps)/128 bytes from global memory, ps bytes for each block, rather than each thread. For the same memory bandwidth, you are reducing the transferred data by over two orders of magnitude. The shared memory’s latency is very small compared to the global memory’s, so you also have less latency to worry about once the data arrives.

Shared memory does what UPS does. It takes the packages and puts thousands of them on one truck, instead of of driving thousands of trucks. There’s only one exit, and each truck has to wait in line to leave. If you put thousands of trucks, you’ll wait a very long while for all of them to leave, but if you put all packages in one truck, there’s almost no delay. Of course, there’s the cost of loading all the packages in the one truck (reading from global to shared mem), but overall, the benefit is enourmous.

Hope this helps,


thanks for the reply. my main problem is with the practical implementation of the shared memory - the coding aspect. i understood the theoretical part of it. I would like to discuss the following code which aims at transposing a matrix using shared memory:

#include "stdafx.h"

#include <stdio.h>

#include <conio.h>

#include <cuda.h>

#define block_size_x 16

#define block_size_y 16

#define BLOCK_DIM 16

__global__ void transpose(float *out, float *in, int w, int h)



	__shared__ float block[BLOCK_DIM*BLOCK_DIM];

	//Block Index

	int bx=blockIdx.x;

	int by=blockIdx.y;

	//Thread Index

	int tx=threadIdx.x;

	int ty=threadIdx.y;

	//Row and Column of output

	int row=by*block_size_y+ty;

	int col=bx*block_size_x+tx;

	int block_index=ty*BLOCK_DIM+tx;

	int transpose_index=tx*BLOCK_DIM+ty;


	int col1,input_index,output_index,tem;


	for(int m=0;m<w/BLOCK_DIM;++m)












int main(int agrc, char* argv[])


	float *in,*out,*in_gpu,*out_gpu;

	int h=16,w=16,i,j;

	size_t size=h*w;

	FILE *ip,*op;












			fprintf(ip,"%.2f	",in[i*w+j]);





	cudaMalloc((void **) &in_gpu, size);

	cudaMalloc((void **) &out_gpu, size);

	cudaMemcpy(in_gpu, in, size, cudaMemcpyHostToDevice);

	// Define number of threads in a block i.e., block size (dimBlock) and number of blocks in a grid i.e., grid size (dimGrid)

	dim3 dimBlock(block_size_x,block_size_y);

	dim3 dimGrid(w/block_size_x,h/block_size_y);

	transpose<<<dimGrid, dimBlock>>>(out_gpu,in_gpu,w,h);

	cudaMemcpy(out, out_gpu, size, cudaMemcpyDeviceToHost);








			fprintf(op,"%.2f	",out[i*h+j]);








	return 0;


Could you tell me what is wrong with it. You can see what it is producing by running it in you’re system. I would like to talk in detail about this on the chat. Just tell me when you will be available online, or leave me you’re ID or add me and let me know where did u add me - like yahoo/aim. Also, my MSN id is