Processing Hyperspectral Datacubes, using shared memory?

I am currently dealing with radiometric correction of hyperspectral datacubes (hopefully in almost real-time). The dimensions of my datacube is 1600x400xN, and the data used in correction is 1600x400. N is the number of samples taken in a specified period. All I need to do is subtract correction data from every sampled frame, N in total.
Currently I am using global memory for both samples and correction data and my kernel does its operations on global memory, however, I believe something better than this can be done and I am simply asking: “what can be done better?”

(1) I am currently using several streams to overlap copy, process and copy (back) operations.
(2) Is there a method to use shared memory in a sensible way for this problem? As in the third dimension the correction data to be used is invariant.
(3) Let us assume that for every 16x16x(N=32) datacube piece I wanted to use shared memory. How can I guarantee the correction data (16x16) is copied from global memory to shared memory exactly once and not for 32 times? Is there a way to process each 16x16x1 datacube pieces without bank conflicts? Is there anything I need to specify or CUDA/compiler somehow optimizes this?
(4) Is there a rule of thumb kind of way to determine when to use global memory or shared memory to gain speedups? Similarly determine batch sizes to process (N) in my case.


Have you tried searching/reading anything about shared memory? The vast amount of material written and questions already answered leave almost no ground uncovered.

It was more of a question to understand how to decide what it would be more optimal. I ended up using shared memory & 1-D texture memory
Thanks for your feedback.