Optimum thread count

Hello everyone

If I run the code as follows, I get the correct result matrix

example<<<1.256>>>(d_a, d_b, d_c)

If I run it like this, the result matrix returns wrong.

example<<<10.256>>>(d_a, d_b, d_c)

Where can there be an error

Can you post your implementation of example please, along with the correct and incorrect matrices?

A shot in the dark tells me you are encountering race conditions.