Help analyzing Visual Profiler Output

Hello,

I have run an instance of the Visual Profiler on my CUDA code. I am having a little bit of trouble figuring out some of the info coming back from it. For starts, I have a kernel that executes the same code on the same size arrays 3 times. When I look at the memcpyDeviceToHost, for the first call, it is very small, like 7 times shorter than the time for the next two calls. Why could that be? Its moving the same amount of data each time.

Secondly, I have run the occupancy analysis, but I am not quite sure where to start trying to optimize my kernels.

Here is the output:

Occupancy analysis for kernel ‘cudaLeefilter’ for context ‘Session1 : Device_0 : Context_1’ :
Kernel details : Grid size: 56 x 881, Block size: 96 x 1 x 9
Register Ratio = 0.8125 ( 26624 / 32768 ) [29 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 1 : 8
Active threads per SM = 864 : 1536
Occupancy = 0.5625 ( 27 / 48 )
Achieved occupancy = 0.5625 (on 16 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (49336) is not a multiple of available SMs (16).

Occupancy analysis for kernel ‘calc_span’ for context ‘Session1 : Device_0 : Context_1’ :
Kernel details : Grid size: 167 x 110, Block size: 32 x 32 x 1
Register Ratio = 0.65625 ( 21504 / 32768 ) [21 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 1 : 8
Active threads per SM = 1024 : 1536
Occupancy = 0.666667 ( 32 / 48 )
Achieved occupancy = 0.666667 (on 16 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (18370) is not a multiple of available SMs (16).

Occupancy analysis for kernel ‘calc_covMatrix’ for context ‘Session1 : Device_0 : Context_1’ :
Kernel details : Grid size: 56 x 1168, Block size: 96 x 1 x 9
Register Ratio = 0.40625 ( 13312 / 32768 ) [14 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 1 : 8
Active threads per SM = 864 : 1536
Occupancy = 0.5625 ( 27 / 48 )
Achieved occupancy = 0.5625 (on 16 SMs)
Occupancy limiting factor = Block-Size

So obviously my grid size is incorrect, but what do some of those other numbers mean? What number am I looking to improve? The last two listed here run very fast, but the first one seems to take a while to run - 2 seconds each run, versus a few milliseconds for the others.