Device occupancy and correctness of computation

For my kernel, If I occupy K40 device in such a way that only one block can run at time ( say using 44K shared memory per block), it computes correctly. If I reduce the amount of shared memory per block (say 4k) to allow more blocks to run concurrently, the computations seem wrong.

From algorithmic point of view, my kernel should work irrespective of number of block running concurrently.

Is there any clue ?