Loss of results accuracy increasing blocks per multiprocessor

I developed a code which is able to provide accurate (superimposable wrt CPU) results whenever 1block x SM is used, while the accuracy is progressively lost when solving for >1block x SM up to the maximum one allowed.
A check by means of nvprof has shown that I can solve concurrently up to 8 blocks per SM (which is acceptable for my GPU card as well), with limitations occurring due to registers number which I am trying to reduce. The same values are obtained thanks to the excel sheet furnished by nvidia for occupancy. My current card has 22 SM, and I have another one, with 80SM. Same problem occurring with both of them.
I am also taking advantage of the use of shared memory, which is allocated as

extern __shared__ double shared[ ];
double *share = &shred[0];

in global function, by assigning a third value in triple brackets kernel call

kernelName<<<nBlocks, nThreads, nThreads*sizeof(double)>>>(data1,…,dataN);

Also, if I force the GPU card to solve concurrently 1block x SM (for ex by increasing the shared memory dimension), the solution is correct even if nBlocks >>> 1xSM.
It’s my first complex code in CUDA, and I’m quite lost on which might be the error in my code.
May I ask some suggestions please? What do you think might be the error?