Thats actually shouldn’t be an issue. If you manage to break your 16GB to half, you’ll be able to
break it into 10,20,… smaller pieces.
Let me describe what I do. I have a TCP master which listens and waits for job requests from the slaves.
Each slave (distinct host server) can use 1…8 GPU cards. This is how it can be 1…8:
1 GPU per Host: a single GTX280 or a single C1060
2 GPUs per Host: 2 GTX280, 2 C1060, 1 GTX295 (which is dual), or one half of a S1070
3 GPUs per Host: 3 GTX280, 3 C1060, 1 GTX280 or 1 C1060 and one GTX295
4 GPUs per Host: 1 S1070, 4 GTX280, 4 C1060, 2 GTX295
…
8 GPUs per Host: 2 S1070 or maybe 4 GTX295 (Colfax just released a 8 C1080 case)
When you run devicequery you see only the GPUs PHYSICALLY connected to your host machine.
Each GPU is independant (both within the context of the host and certainly within the context of the 4 or 8 or whatever
number of hosts node you have).
Lets get back to the master/slave paradigm. Each slave opens up X CPU threads (pthreads on linux for example)
where X is the number of physical GPUs that slave/host have. It then requests the Master node to send the
data needed for processing (for seismic - each GPU for example will process a different trace or a part of the trace or
one velocity for one trace…) it then will copy the data over PCI to his dedicated GPU (by using cudaSetDevice you attach
a different GPU to a different CPU thread - google for GPUWorker by Mr Anderson here in the newsgroups and you can
have a working implementation of this).
Then each slave/host thread starts a kernel to process all the data - once the GPU is done, copy back the results to the
CPU and send the result to the Master. Get a new job :)
How do you distribute your work now to the CPU nodes? it should be done in the exact way - much like CPU node 1
shouldnt communicate with CPU node 8 (in your current CPU based cluster) GPU #1 shouldnt talk to GPU #8 nor should
the CPU nodes communicate with each other either.
I guess the main concept is that each GPU should be atomic and independant in its work - it gets its portion of the work
that should be done, process it and sends it back to the calling CPU thread …
feel free to ask anything else :)
eyal