Deal with big sized shared memory

Hello ,

I am using shared memory in my program

__shared__ myshared[tile_width][tile_width][N]


In the first 2 dimesions I am passing the input data for 1 image and in the third (N) I want to pass the number of the images to create.

The problem is that N maybe up to 2000 , so it goes above the size of shared memory.

How can I deal with this?

Note , that I am using:

dim3 dimGrid(  (Cols / tile_width)  ,  (Rows / tile_width) , N );
dim3 dimBlock( tile_width , tile_width);

As far as I know I don’t have any control to how the work is distributed among the MPs.My card has 14MPs.

Also , another problem I have is that the N (number of images) is defined in a cpp file which calls an extern function which is defined in cu file.

And N is an argument to a kernel function.How can I use N in my shared variable?Because right now I am not using N (as the 3rd dimension but I am writting directly the number of the images).

Thank you!

re-use shared memory. Ultimately, anything that shows up in shared memory has to be copied there in the first place. So organize your code so you can deal with one image at a time (instead of N) and copy each image as you need it.

Can you give me an example please?

I can’t understand how to use one image at a time.
Also , because I am passing the number of the images as an argument to the kernel , I would like to use it.

host code:

dim3 dimGrid(  (Cols / tile_width)  ,  (Rows / tile_width) , 1);

kernel code:

__shared__ int myshared[tile_width][tile_width];

for (int i = 1; i < N; i++){

Hmm, ok with that ,just I am not sure now how to use each image.
I mean the index of each.

Because right now I am using:

int bx = blockIdx.x , by = blockIdx.y , bz = blockIdx.z;
int tx = threadIdx.x , ty = threadIdx.y , tz = threadIdx.z;

int RowIdx = ty + by * tile_width;
int ColIdx = tx + bx * tile_width;
int NIdx = tz + bz * blockDim.z;

and then:

J = RowIdx * Cols + ColIdx + Rows * Cols * NIdx;
myshared[ty][tx][NIdx] = *( dev_input + J );

Now , you say just use:

for (int i=1; i<N; i++)
  myshared[ty][tx] = *( dev_input + J )

I am not sure how to implement the “i” index…

__shared__ int myshared[tile_width][tile_width];

int bx = blockIdx.x , by = blockIdx.y;
int tx = threadIdx.x , ty = threadIdx.y;

int RowIdx = ty + by * tile_width;
int ColIdx = tx + bx * tile_width;

for (int i = 1; i < N; i++){
  int NIdx = i;
  int J = RowIdx * Cols + ColIdx + Rows * Cols * NIdx;
  myshared[ty][tx] = *( dev_input + J );

Ok ! Thanks a lot!
I will check it and write back!

PS: There is no other way to do it ,using my 3d example?
Load the data for the maximum capacity of shared memory to the MPs and then ,when the MPs are finished , continue with the next series of them.
Or ,using dynamically allocated shared memory?I don’t know…

Well I don’t know how else to do it. Fundamentally, shared memory is “kind of” small. If you’re going to use it, you’ll have to figure out how to fit things in it. Based on the fact that you have your data naturally segmented into “images” it seemed reasonable to break it up that way. I’m sure there are other approaches, but I don’t have other ideas.

OK, thank you very much for your help.

Something last.

Using this for loop ,I still have each thread processing one pixel as in my example , right?

And for one image , I am launching 20 blocks

( dim3 dimGrid(  (Cols / tile_width)  ,  (Rows / tile_width)); )

each with 256 threads.

Now , that I have in the code the for loop to process each image and let’s say I am processing 10 images, I still use 20 blocks per image and 256 threads per block?

I mean , for 10 images I have to do 10 * 20 = 200 blocks?
Or the blocks remain 20?

My card has 14MP’s and up to 8 blocks/MP , so 8*14 = 112 blocks at the same time.
Now , I have only 20 blocks at the same time?

Is it something I must check in order to achieve best occupancy?

Hmm , I figured that in your example code , I still have to use the 3rd dimension of blocks to hold the number of images.

dim3 dimGrid(  (Cols / tile_width)  ,  (Rows / tile_width) , N );

Else , I am getting wrong results.

So what about the blocks and MPs in this case??

I think I’ve actually posted more lines of code into this thread than you have. Yes, the “process_image” code will have to be modified to reflect that the images are being processed sequentially rather than by a larger grid of threads. You haven’t shown any of that code so I’m not sure how I could comment on that. I’m sure you can figure it out.