The size of d_b is quite big compared with shared memory size.
What I understand is: the kernel inverts the order of the input array, in other words, it changes the allocation of each one of the d_a positions by storing the data in reverse order and puts the result in d_b. My question is: How it makes this?, given the fact that this is achieved using a small size array in shared memory and it’s done for every position in the array.
Other question is: what is the use of “gridDim.x”'. I don’t know what it is.
I almost understand this. But, there’s something weird to me yet. When you say “part of d_b”, we are talking about a data set that is the size of shared memory. But what happens to the other parts of d_b?
Can you please explain me that?. Maybe thats a basic concept but I don’t get it.
Supposing we were running this with 512 threads per block and size of d_a is 1,024,000 elements.
Then we would want dimGrid.x to be 2000
each of those 2000 blocks will reverse the order of 512 elements, each block doing 1/2000 of d_a but together they do the whole array.
NB
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
if blockDim.x is 512 and say blockIdx.x is 1000
then inOffset will be 512000 which means block 1000’s threads will be reading from d_in[512000] to d_in[512511]
And this line
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
becomes
s_data[512 - 1 - 511] = d_in[512511];
i.e. s_data[0] = d_in[512511];
NB The model of GPU used doesnt matter, some may be only able to run a few blocks at once, others 240 blocks at once, but once a block finishes the GPU will start the next block that needs doing until all are done (and entire array has been reversed)
Each block only sees the shared memory allocated to it, when a block finishes and a new one starts the shared memory is reused but erased first.
Also for a real kernel we would want to be able to pass an array of any size to it. The simple kernel you are using is just an example, but needs a little more code to allow say an array of say 987 elements to be reversed. (or any size other than an exact multiple of the number of threads per block)