I am currently dealing with radiometric correction of hyperspectral datacubes (hopefully in almost real-time). The dimensions of my datacube is 1600x400xN, and the data used in correction is 1600x400. N is the number of samples taken in a specified period. All I need to do is subtract correction data from every sampled frame, N in total.
Currently I am using global memory for both samples and correction data and my kernel does its operations on global memory, however, I believe something better than this can be done and I am simply asking: “what can be done better?”
(1) I am currently using several streams to overlap copy, process and copy (back) operations.
(2) Is there a method to use shared memory in a sensible way for this problem? As in the third dimension the correction data to be used is invariant.
(3) Let us assume that for every 16x16x(N=32) datacube piece I wanted to use shared memory. How can I guarantee the correction data (16x16) is copied from global memory to shared memory exactly once and not for 32 times? Is there a way to process each 16x16x1 datacube pieces without bank conflicts? Is there anything I need to specify or CUDA/compiler somehow optimizes this?
(4) Is there a rule of thumb kind of way to determine when to use global memory or shared memory to gain speedups? Similarly determine batch sizes to process (N) in my case.