I am trying to get the most out of a 16 xeon core Dell R710 RHEL5.5 64 bit system running the 3.2 driver and toolkit with a multi-gpu tesla c2050. So, I am using separate posix threads per gpu and doing the cudaSetDevice dance. When I run in a single process space, I top out at around 2500 host thread algorithm iterations per second (aggregate) regardless of how many threads and/or gpus are used in the application. When I split my threads to two or more applications, I top out at around 5000 iterations per second(aggregate) through the host thread algorithm regardless of how many processes, threads and gpus are in use. Each of my threads is independent and will run as fast as it can.
For example, If I run a single process with 4 host threads pointed at gpu 0 and 4 host threads pointed at gpu 1, I get around 2500 aggregate iterations through the host thread algorithm. If I then split the application into two separate processes, one with 4 host threads pointed at gpu 0 and another with 4 host threads pointed at gpu 1, each process gets around 2500 aggregate iterations for a total of 5000 aggregate host thread algorithm iterations across both processes. If I then run with 3 applications, 4 threads at gpu 0,1 and 2 respecitively, I still only end up with around 5000 aggregate host thread algorithm iterations across all three apps: each app ends up with around 1700 aggregate iterations for the application.
So, it appears that there is some kind of per process limit and a total system limit.
Has anyone else run into this or have any ideas about how to get around it?
Thanks for any assistance,