Using shared memory There's something I don't understand

Hello,

I need some help to understand something, I’m talking about the shared memory use. The following code is snipped from Dr. Dobbs article series:

global void reverseArrayBlock(int *d_out, int *d_in)
{
extern shared int s_data;

int inOffset  = blockDim.x * blockIdx.x;
int in  = inOffset + threadIdx.x;

s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

__syncthreads();

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];

}

and the next is the way the kernel is launched:

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

The size of d_b is quite big compared with shared memory size.
What I understand is: the kernel inverts the order of the input array, in other words, it changes the allocation of each one of the d_a positions by storing the data in reverse order and puts the result in d_b. My question is: How it makes this?, given the fact that this is achieved using a small size array in shared memory and it’s done for every position in the array.

Other question is: what is the use of “gridDim.x”'. I don’t know what it is.

any idea?

hi,

Here gridDim.x = dimGrid = numBlocks . ( number of blocks in your grid).

So:
[i]
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;

s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
__syncthreads();
[/i]

In your block you load part of d_b in shared memory in reverse order.

[i]
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];[/i]

And you write it back in d_a at the good place.

Hi, Thanks for replying.

I almost understand this. But, there’s something weird to me yet. When you say “part of d_b”, we are talking about a data set that is the size of shared memory. But what happens to the other parts of d_b?

Can you please explain me that?. Maybe thats a basic concept but I don’t get it.

Thanks again.

Supposing we were running this with 512 threads per block and size of d_a is 1,024,000 elements.

Then we would want dimGrid.x to be 2000

each of those 2000 blocks will reverse the order of 512 elements, each block doing 1/2000 of d_a but together they do the whole array.

NB

int inOffset = blockDim.x * blockIdx.x;

int in = inOffset + threadIdx.x;

if blockDim.x is 512 and say blockIdx.x is 1000

then inOffset will be 512000 which means block 1000’s threads will be reading from d_in[512000] to d_in[512511]

And this line

s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

becomes

s_data[512 - 1 - 511] = d_in[512511];

i.e. s_data[0] = d_in[512511];

NB The model of GPU used doesnt matter, some may be only able to run a few blocks at once, others 240 blocks at once, but once a block finishes the GPU will start the next block that needs doing until all are done (and entire array has been reversed)

Each block only sees the shared memory allocated to it, when a block finishes and a new one starts the shared memory is reused but erased first.

Also for a real kernel we would want to be able to pass an array of any size to it. The simple kernel you are using is just an example, but needs a little more code to allow say an array of say 987 elements to be reversed. (or any size other than an exact multiple of the number of threads per block)

Hope that helps

kbam

Hi kbam, it really helps!

I see what happens.

Thanks a lot!