There’s no way to answer without breaking down your problem to understand what it’s doing and how it could be done.
But an immediate observation: if you say each thread needs 2KB of shared memory, your approach doesn’t map at all well to the GPU. You want to use not 8 threads per block (or per SM), but more like
256 or preferably 512, and then run hundreds of those blocks.
Alternatively, take one of your thread’s tasks (and its 2K of data) and figure out how you might use at least 32 threads to solve that problem in parallel. Then you can have each warp of a block work on what you currently have each thread working on. But this gets into the whole “it depends what your problem is” conclusion.