Ok, then there must be some other reason for my problem. If I run the function as above, parts of the matrix I am calculating are NaN. If I outcomment one of the kernels, or set N to a small number, the results are fine. I do not use shared memory and there is not one division inside one of the kernels so I am out of my wits what might even cause NaN?
It might well be the case that the Nan values are coming from uninitialised memory. Either because you code is reading out of bounds, or because the code isn’t actually running to completion, leaving some of the output memory untouched.
The initial matrix contained some very small floating numbers. I’m not entirely sure, how this might have caused the problems; anyway, I scaled them up a bit and the program works, so I am happy for now :-) Thanks for the feedback!