Not all blocks executing

I have a kernel call with 128 blocks but as i can see while debugging and from the final result, part of the blocks was not runing at all.
The reason i think that they are not runing is that i can’t catch few blocks while debuging and i tried few ways to do it:

  1. conditional breakpoints
  2. condition in the code for blockIdx and a breakpoint inside the condition.
    Important to say that i’m always checking cudaError_t and is ok (no errors on cuda api or my kernal)
    any ideas what could be the reason for this kind of behaviour? and how it can be resolved?

I’m using GTX 460


To sure about how many blocks are being executed, you can launch the CUDA profiler or the computeprof.
These profilers give you a lot of information and the number of blocks executed.