floating point precision varies between warps

I am doing numerical integration on a set of neurons, one neuron mapped to one thread. The spike occurrence times are numerically integrated in CUDA kernel with only a single block. I expect the same spike trains for say 100 neurons, since all of them work under the same condition, but it is observed that the first 32 produce same spike trains while next 32 have spike times different from the first, and so on … for different warp sizes.
I use __expf function and ‘float’ variables.
I found that the floating point values calculated differ between warps. How can I avoid such a problem?
ie, what I want is to get the same floating point precision for all the threads in a block.
Could any one have a similar problem?
I use GeForce GT 440

I assume you meat threadblock size, not warp size, which is fixed. Given that 32 happens to be the warp size, I hazard a guess that there is an issue with your code. The first thing I would do is check for out-of-bounds accesses and race-conditions. cuda-memcheck can help you find those.