So i have a Quadro 4000 (64 blocks, and 1024 threads per block), and i need to process 10 000 images for the first loop, and then 100 other images for each in the second loop, and then process on zone/region of interest in each images (to do a correlation), let’s say 1000 max.
How can I efficiently cut theses nested loop to run the GPU at full potential … ?
I was thinking to use the shared memory for 3 lines, 1 of the image of the first loop, and 2 of the second loop … this way, blocks can be the second loop (2*64=128 > 100) and threads the last loop of region of interest (1024 > 1000, which is already a maximum).
What do you think ? Is it possible ? I’m a beginner …
for images from 1 to 10000
for images from 1 to 100
for regions from 1 to 1000
do correlation
end loop for region
end loop for second images
end loop for first images
Maybe I can start by only making the GPU doing the correlation calculation … How can I implement the classical iFFT[FFT(images1)*FFT(images2)] the easiest way possible ?
FFT use: there is an Nvidia FFT routine – download it from the Cuda page.
how to organize your program? You need to understand better the GPU model; typically I use one thread per pixel. use your main Host program to keep track of the loops. copy all images to GPU board once.
So you think the best is to keep the loops and to only do the calculation by FFT with 1 thread per pixel ?
I believed the main point to increase the performance was to use the shared memory (and for example do a time correlation line by line) ?