Strange CUDA behaviour CUDA behaves strangely for different block thread configurations

I have a ode(differential equation) solver, which i have system parallelized i.e. I run the same solver on different threads but with different inital values on each thread.
I need a global synch at each step of solving the ode(bcoz i need to share data b/w solvers on different blocks) so i call the kernel in a loop.
Now, the problem is when i run the kernel with :

  1. No. of threads less than 16, it runs perfectly although a slower than CPU and produces the correct results with any number of blocks.
  2. If i make number of threads 16 the program does not run, and i start getting NAN as the results.
  3. I can’ t run it with 32 or more threads per block because of the amount of shared memory i am using.

Could the problem i mentions in point 2 be related to shared memory or what else ?
It would be great if some one could point me in the right direction so that i can get a clue as to why this is happening.

Thanks !!