Optimization at beginning level

My project contains 2 kernels along with 2 callback host functions. I am running my kernels on GTX930mx (integrated).

I have some question regarding the profiler output.

  1. In the kernel latency section, there is pie chart of Stall Reasons, What the “other” factor in the chart indicates. Here is my stall Reason pie chart


  1. My execution configuration is the following:
Threads per Block = 512
Blocks per Grid   = 33
Total Number of SM's of my GPU = 3

The profiler prompt the following:
The Achieved active warps is 54 in my case and the device impose limit is 64. As far as i understood it, The 54 is because of the reason that if one block is more to be active, it will increase the warp limit i.e. 64 because 512/Warp_Size = 16 and 54+16 > 64. Here is the screenshot


  1. Multiprocessor utilization shows that my SM’s are utilized to about ~90%, Is that a good sign to know that i’m about to reach the limit of my GPU capability ?


I am open to any suggestion.

P.S: Excuse my English, please.