plenty of registers spare,
occupancy is above 0.5 so thats fine
loads of shared memory left
If you get a bottleneck on transfers to/from global arrays you might be able to improve that using the shared memory.
If your design allows it then you can increase occupancy by changing your blocks to say 16 x 12. Occupancy is only one thing that affects performance though, so it might not run any faster with 192 threads per block, might even run slower.
Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 48
Maximum number of resident threads per multiprocessor 1536
So what it is saying that you could run larger blocks, with your current code each SM will have 8 active blocks of 4 warps each (total 32) running, but hardware can have 48 warps. Worth trying larger blocksize if easy to do, see if your run time drops.
Each of the 14 multiprocessors can be assigned more than one block at a time.
“Maximum number of resident blocks per multiprocessor 8”
Each block can only be actually doing instructions for one block at a time, but if that one has to wait for data the mulitprocessor very quickly switches context to another block. In this way the delay in waiting for data ( ‘latency’ ) is hidden (providing there is a block that is able to run).