__global__ void
darkenImage( float* g_odata, int width, int height, int iterNum)
{
// calculate texture coordinates
float factor = 0.9999;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float g_odatas;
g_odatas = factor*tex2D(tex, x, y);
for (int i=1; i<iterNum; i++){
// read from texture and write to global memory
g_odatas = factor*g_odatas;
}
g_odata[y*width + x] = g_odatas;
}
Without shared memory:
__global__ void
darkenImage( float* g_odata, int width, int height, int iterNum)
{
// calculate texture coordinates
float factor = 0.9999;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
g_odata[y*width + x] = factor*tex2D(tex, x, y);
for (int i=1; i<iterNum; i++){
// read from texture and write to global memory
g_odata[y*width + x]= factor*g_odata[y*width + x];
}
}
My question is: Why does this code work?
Fact: Using shared memory accelerates this function dramaticaly.
I am unable to understand the shared memory mechanism.
Does every thread in the block redefines g_odatas? if so then how is it shared?
If not then why my pictures still get darkened alright?
Anyway, where can I find a good reference to an (beginner/intermediate) article explaining how to work with shared memory in cuda? (I am especialy intrested in image manipulation)
The one with shared memory can not work like that, Are you sure that your image does not get distorted in blocks?. You could make it work if you change your code to :
__global__ void
darkenImage( float* g_odata, int width, int height, int iterNum)
{
// calculate texture coordinates
float factor = 0.9999;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float g_odatas[NUM_THREADS];
unsigned int index = threadIdx.x * blockDim.x + threadIdx.y
g_odatas[index] = factor*tex2D(tex, x, y);
for (int i=1; i<iterNum; i++)
g_odatas[index] = factor*g_odatas[index];
g_odata[y*width + x] = g_odatas[index];
}
But I would never use shared memory here, since you do not need to communicate between threads. So just use a register (take away the shared). Also your code can be probably be much faster when you do this :
__global__ void
darkenImage( float* g_odata, int width, int height, int iterNum)
{
// calculate texture coordinates
float factor = 0.9999;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
g_odata[y*width + x] = powf(factor, (float) iterNum) * tex2D(tex, x, y);
}
You are right about the usage of powf, however, this function isn’t the key of my work, I just use it to test some ideas before I implement them in the real kernel.
However, if I got you right, then when I declare a shared memory array, it is only defined once? means that the first thread that is executed is responsible for it?
In addition, when you wrote:
shared float g_odatas[NUM_THREADS];
It means that NUM_THREADS is the total number of threads in the grid? isn’t it supposed to be the total number of threads per block?
But if I declare it outside the kernel, is it still shared for each block? What is the difference between declaring the array outside the kernel and declaring it inside the kernel?
There is no difference. The shared memory is shared between all the threads in a block (so in the above code it would be threadDim.x * threadDim.y). The only difference is :
when you declare it in your kernel, you need to know the size at compile-time.
when you declare it outside your kernel you can allocate it at runtime, but you can only have 1 array then (so you need to make pointers into the array when you need more than 1 shared array in your kernel. There is an example in the documentation I believe.
I am new to CUDA too and I am a 3rd year SE student.
Correct me if I’m wrong:
Each shared memory declared in the kernel of size Z will be allocated once (When is it allocated?) and the total size of the allocated memory will be of size ZblockDim.x * blockDim.y.
If so then each block of threads will share that memory of size Z.
If you declared shared float dummy[2] in your kernel, then each block will have 2 floating point elements shared. If you declared shared float dummy[threadDim.x * threadDim.y], then you would have 1 floating point element for each thread in the block. It would not work btw, since you need to know the size at compile time when declaring it in your kernel. It is ‘allocated’ before your block runs.
Read up on shared memory in the docs, it is quite extensive on the subject. Also the SDK has many examples using shared memory in different ways.
When thinking about shared memory forget about your grid, you should consider only block size.
So, if you have 12x16 block of threads and you declare shared float dummy[2] in kernel body then only 2-element array will be allocated which will be shared among all 12x16 threads in block.