So i have a Quadro 4000 (64 blocks, and 1024 threads per block), and i need to process 10 000 images for the first loop, and then 100 other images for each in the second loop, and then process on zone/region of interest in each images (to do a correlation), let’s say 1000 max.
How can I efficiently cut theses nested loop to run the GPU at full potential … ?
I was thinking to use the shared memory for 3 lines, 1 of the image of the first loop, and 2 of the second loop … this way, blocks can be the second loop (2*64=128 > 100) and threads the last loop of region of interest (1024 > 1000, which is already a maximum).
What do you think ? Is it possible ? I’m a beginner …
Thank you !