Shared memory mechanism

Hello,

The following function darkens an image.

With shared memory:

__global__ void

darkenImage( float* g_odata, int width, int height, int iterNum) 

{

	// calculate texture coordinates

	float factor = 0.9999;

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	__shared__ float g_odatas;

	g_odatas = factor*tex2D(tex, x, y);

	for (int i=1; i<iterNum; i++){

	// read from texture and write to global memory

	g_odatas = factor*g_odatas;

	

	}

	g_odata[y*width + x] = g_odatas;

}

Without shared memory:

__global__ void

darkenImage( float* g_odata, int width, int height, int iterNum) 

{

	// calculate texture coordinates

	float factor = 0.9999;

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	g_odata[y*width + x] = factor*tex2D(tex, x, y);

	for (int i=1; i<iterNum; i++){

	// read from texture and write to global memory

	g_odata[y*width + x]= factor*g_odata[y*width + x];

	

	}

}

My question is: Why does this code work?

Fact: Using shared memory accelerates this function dramaticaly.

I am unable to understand the shared memory mechanism.

Does every thread in the block redefines g_odatas? if so then how is it shared?

If not then why my pictures still get darkened alright?

Anyway, where can I find a good reference to an (beginner/intermediate) article explaining how to work with shared memory in cuda? (I am especialy intrested in image manipulation)

Thanks in advance,

Aviad.

The one with shared memory can not work like that, Are you sure that your image does not get distorted in blocks?. You could make it work if you change your code to :

__global__ void

darkenImage( float* g_odata, int width, int height, int iterNum)

{

// calculate texture coordinates

float factor = 0.9999;

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

__shared__ float g_odatas[NUM_THREADS];

unsigned int index =  threadIdx.x * blockDim.x + threadIdx.y

g_odatas[index] = factor*tex2D(tex, x, y);

for (int i=1; i<iterNum; i++)

  g_odatas[index] = factor*g_odatas[index];

g_odata[y*width + x] = g_odatas[index];

}

But I would never use shared memory here, since you do not need to communicate between threads. So just use a register (take away the shared). Also your code can be probably be much faster when you do this :

__global__ void

darkenImage( float* g_odata, int width, int height, int iterNum)

{

// calculate texture coordinates

float factor = 0.9999;

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

g_odata[y*width + x] = powf(factor, (float) iterNum) * tex2D(tex, x, y);

}

Hey,
Thanks for the quick and useful reply!

You are right about the usage of powf, however, this function isn’t the key of my work, I just use it to test some ideas before I implement them in the real kernel.

However, if I got you right, then when I declare a shared memory array, it is only defined once? means that the first thread that is executed is responsible for it?
In addition, when you wrote:

shared float g_odatas[NUM_THREADS];

It means that NUM_THREADS is the total number of threads in the grid? isn’t it supposed to be the total number of threads per block?

Thanks!

You can also declare your shared array outside kernel, it may be easier to understand it in this way.

This is number of threads per block.

But if I declare it outside the kernel, is it still shared for each block? What is the difference between declaring the array outside the kernel and declaring it inside the kernel?

Thanks

There is no difference. The shared memory is shared between all the threads in a block (so in the above code it would be threadDim.x * threadDim.y). The only difference is :

  • when you declare it in your kernel, you need to know the size at compile-time.

  • when you declare it outside your kernel you can allocate it at runtime, but you can only have 1 array then (so you need to make pointers into the array when you need more than 1 shared array in your kernel. There is an example in the documentation I believe.

Hello,

I am new to CUDA too and I am a 3rd year SE student.

Correct me if I’m wrong:

Each shared memory declared in the kernel of size Z will be allocated once (When is it allocated?) and the total size of the allocated memory will be of size ZblockDim.x * blockDim.y.
If so then each block of threads will share that memory of size Z.

Did I understand it right?

No, you did not understand it good.

If you declared shared float dummy[2] in your kernel, then each block will have 2 floating point elements shared. If you declared shared float dummy[threadDim.x * threadDim.y], then you would have 1 floating point element for each thread in the block. It would not work btw, since you need to know the size at compile time when declaring it in your kernel. It is ‘allocated’ before your block runs.

Read up on shared memory in the docs, it is quite extensive on the subject. Also the SDK has many examples using shared memory in different ways.

I think thats what I said, maybe I am wrong.

For example:

The total number of blocks is: gridDim.x * gridDim.y (lets say 16x12)

Each block share Z memory elements: lets say I declared: shared float dummy[2] → Z=2

That sums up to a total memory elements of: ZgridDim.x * gridDim.y → 216*12

Am I still wrong?

(edit) blockDim=> gridDim my mistake

When thinking about shared memory forget about your grid, you should consider only block size.

So, if you have 12x16 block of threads and you declare shared float dummy[2] in kernel body then only 2-element array will be allocated which will be shared among all 12x16 threads in block.

This is exactly what DenisR have already said.

Thank you for the quick answer. I really appreciate it.

I now have a better understanding about the shared memory mechanism.

Although i still don’t understand something:

I was looking at all projects using shared memory to have a better understanding and in some projects i find this kind of shared memory allocation:

The bold line is the interesting one.

What is the purpose of extern here? Dynamic allocation? What is the size of the memory allocated here?

It says in the manual that all memory allocated this way start in the same address but nothing is said about its size.

I am sorry for my lack of knowledge but i am trying to understand. This parallel processing with shared memory is really tricky.

This memory is allocated outside of the kernel. In the function that is calling the kernel you will see the kernel is called like this :

bitonicSort<<<gridDim,blockDim,shared_memory_size>>>(…)