I’m a beginner at CUDA programming - so please forgive my possible misunderstandings! thanks
First the algorithm I try to implement will work as follow.
- transfer two images (of +/- 10MP) from CPU to GPU
- from given seeds in the two image, extract two NxN sub-image (say N=32)
- perform element-wise multiplication between the two sub images, store the temporary result in shared memory (shared memory usage = NxNxsizeof(float) )
- perform a box filter (once in direction x, once in direction y) on the temporary result (-> note that this constraint me of having thread blocks of N threads, or of a multiple of N threads at least to keep all threads busy)
- perform a correlation, given a formula, not relevant here.
- copy memory from GPU to HOST.
(one would have to consider border effect as well).
My very issue here is that the amount of shared memory is proportional to N^2, which will result in an extremely poor occupancy…
How would you approach such kind of problem? Or are this kind of algorithms not well suited to CUDA?
Any help would be greatly appreciated. Thanks