What are the main differences between programming a two C870s and one S1070?

I thought that the S1070 was simply four C1060s wired together, but the S1070 system may be slightly different to that.

I have been programming two C870s in parallel using pthreads to manage each device. Does the S1070 appear as one device or four devices? ie to use all four, if there are four distinct devices in the S1070, in parallel would I need 4 threads, or just the one?

And are there any other differences in programming the two systems?

The S1070 appears as 4 separate CUDA devices, so you’ll need one host thread to control each one.

You also have to remember that the S1070 has four cards on two cables (my tests suggest that the cards contest for this bandwidth, rather than dividing the bus). If you’re bandwidth-bound, this might be a problem.