As far as I know I don’t have any control to how the work is distributed among the MPs.My card has 14MPs.
Also , another problem I have is that the N (number of images) is defined in a cpp file which calls an extern function which is defined in cu file.
And N is an argument to a kernel function.How can I use N in my shared variable?Because right now I am not using N (as the 3rd dimension but I am writting directly the number of the images).
re-use shared memory. Ultimately, anything that shows up in shared memory has to be copied there in the first place. So organize your code so you can deal with one image at a time (instead of N) and copy each image as you need it.
I can’t understand how to use one image at a time.
Also , because I am passing the number of the images as an argument to the kernel , I would like to use it.
__shared__ int myshared[tile_width][tile_width];
int bx = blockIdx.x , by = blockIdx.y;
int tx = threadIdx.x , ty = threadIdx.y;
int RowIdx = ty + by * tile_width;
int ColIdx = tx + bx * tile_width;
for (int i = 1; i < N; i++){
int NIdx = i;
int J = RowIdx * Cols + ColIdx + Rows * Cols * NIdx;
myshared[ty][tx] = *( dev_input + J );
process_image(i);}
Ok ! Thanks a lot!
I will check it and write back!
PS: There is no other way to do it ,using my 3d example?
Load the data for the maximum capacity of shared memory to the MPs and then ,when the MPs are finished , continue with the next series of them.
Or ,using dynamically allocated shared memory?I don’t know…
Well I don’t know how else to do it. Fundamentally, shared memory is “kind of” small. If you’re going to use it, you’ll have to figure out how to fit things in it. Based on the fact that you have your data naturally segmented into “images” it seemed reasonable to break it up that way. I’m sure there are other approaches, but I don’t have other ideas.
Now , that I have in the code the for loop to process each image and let’s say I am processing 10 images, I still use 20 blocks per image and 256 threads per block?
I mean , for 10 images I have to do 10 * 20 = 200 blocks?
Or the blocks remain 20?
My card has 14MP’s and up to 8 blocks/MP , so 8*14 = 112 blocks at the same time.
Now , I have only 20 blocks at the same time?
Is it something I must check in order to achieve best occupancy?
I think I’ve actually posted more lines of code into this thread than you have. Yes, the “process_image” code will have to be modified to reflect that the images are being processed sequentially rather than by a larger grid of threads. You haven’t shown any of that code so I’m not sure how I could comment on that. I’m sure you can figure it out.