In expanding the size of the problem I am looking at it appears that there may be something in error with the 2 x C870s I am using.
With a 70k particle problem the results are
Quadro FX 1700 - perfect
C870 #1 - nans all over the place
C870 #2 - nans all over the place
I have been developing my algorithm for small problems only and have only been working on C870 #1. When I went onto multiple GPUs with a bigger problem GPU #2 hung on a cudaMemcpy from device to host and when I simply initialized a lot of data on each of the GPUs as a test the GPU #2 was found to be mixing half warps of data. The IT services seemed to have fixed that last night, but now I am getting the above behaviour.
What could be the problem if the problem is hardware?
I am thinking that because the Quadro runs OK then there is a problem with the Express or connections between host and devices, or with the GPUs, or both.