Is the fault hardware or software?

In expanding the size of the problem I am looking at it appears that there may be something in error with the 2 x C870s I am using.

With a 70k particle problem the results are
Quadro FX 1700 - perfect
C870 #1 - nans all over the place
C870 #2 - nans all over the place

I have been developing my algorithm for small problems only and have only been working on C870 #1. When I went onto multiple GPUs with a bigger problem GPU #2 hung on a cudaMemcpy from device to host and when I simply initialized a lot of data on each of the GPUs as a test the GPU #2 was found to be mixing half warps of data. The IT services seemed to have fixed that last night, but now I am getting the above behaviour.

What could be the problem if the problem is hardware?

I am thinking that because the Quadro runs OK then there is a problem with the Express or connections between host and devices, or with the GPUs, or both.

Personallly, I don’t think it’s a problem with the C870s. There are many reasons why your program would work on a GPU, but not another one. I’m assuming all the cards are on the same system.

Are you using CUDA_SAFE_CALL and does it indicate any errors? It may be that a cudaMalloc or cudaMemcpy is failing for some reason, or maybe the kernel launch generates an error.

Second, are you using any atomics? the FX1700 is compute capability 1.1, while the C870 is 1.0, and doesn’t support atomic operations. Also, just compiling with -arch sm_11 will make the kernel launch fail, which could explain the NaNs.