I have developed some mixed CUDA/MPI code which I hope to run on a cluster of S1070s, but I have a problem with either MPI or CUDA, and I don’t know which!
On our S1070 devices 0,2,3,4 are C1060s and when I try run my code over the four devices device 3 when queried with cuMemGetInfo reports 0% free memory, but then allocates some variables but quickly runs out of device memory. The other three devices all report 99% free memory before allocation, allocate all variables without error and report that after allocation 42% memory is free.
I also have access to a cluster of S1070s at a government research lab and have been trying to get this same code working on that too, to eventually run the code over the whole cluster. Here’s the strange thing. Exactly the same thing occurs on their S1070s, but… on their S1070s devices 0,1,2,3 are C1060s (on ours it’s devices 0,2,3,4) but the memory error always always always occurs on device 3! device 3 on our S1070 and device 3 on their S1070.
Any suggestions as to what is occuring and how to fix this would be welcome because this is very frustrating.