I’m looking to optimize some computer vision code I’m working on. Shared memory seems logical because of the several neighborhood operations I’m doing. Could someone explain the merits of explicitly giving thread blocks a certain amount of shared memory on execution versus not using the ‘extern’ keyword and creating them within the kernel.
Additionally, what is the scope of shared variables, whether they’re created externally or from within the kernel. This would be important in determining whether I have to split my algorithms into several kernels versus a well synchronized, but much larger kernel.