Floating point operation

Hii. I wrote a marix multipcatoin for A*AT. When i compared the actuall result to the expceted result i saw that some values along the diagonal are incorrect.

RTX A4000 Cuda 11.4, Ubunto 10.04.6

The matrix is 5*4 (5 rows , 4 columns). The values are cuFloatComplex ~10^15.
5,5 threads , 1,1 Block

threadIdx.x = 2, threadIdx.y = 2

Just for the example I multiply only the first cell of the row.

z * Conj(z)

acc_sum1.x = 0;
acc_sum1.y = 0;
acc_sum2.x = 0;
acc_sum2.y = 0;
for (int k=0;k < 1;k++)
{
acc_sum1 = cuCaddf(acc_sum1,cuCmulf(matrix[k+thread.y4],cuConjf(matrix[k+threadIdx.x4])));
acc_sum2 = cuCaddf(acc_sum2,cuCmulf(matrix[8],cuConjf(matrix[8])));
}

the value of the first elemnt of raw 2 is
Real -3647191470047232, Image 1640025074696192

The result of the first line
15991687666823789327821742014464,230493328208896511180800

The result of the Second line ( which is the correct result )
15991687666823789327821742014464, 0

Can some one tell me why the result are differnet?