cuda host process throughput limits better performance by moving host threads to separate applicatio

I am trying to get the most out of a 16 xeon core Dell R710 RHEL5.5 64 bit system running the 3.2 driver and toolkit with a multi-gpu tesla c2050. So, I am using separate posix threads per gpu and doing the cudaSetDevice dance. When I run in a single process space, I top out at around 2500 host thread algorithm iterations per second (aggregate) regardless of how many threads and/or gpus are used in the application. When I split my threads to two or more applications, I top out at around 5000 iterations per second(aggregate) through the host thread algorithm regardless of how many processes, threads and gpus are in use. Each of my threads is independent and will run as fast as it can.

For example, If I run a single process with 4 host threads pointed at gpu 0 and 4 host threads pointed at gpu 1, I get around 2500 aggregate iterations through the host thread algorithm. If I then split the application into two separate processes, one with 4 host threads pointed at gpu 0 and another with 4 host threads pointed at gpu 1, each process gets around 2500 aggregate iterations for a total of 5000 aggregate host thread algorithm iterations across both processes. If I then run with 3 applications, 4 threads at gpu 0,1 and 2 respecitively, I still only end up with around 5000 aggregate host thread algorithm iterations across all three apps: each app ends up with around 1700 aggregate iterations for the application.

So, it appears that there is some kind of per process limit and a total system limit.

Has anyone else run into this or have any ideas about how to get around it?

Thanks for any assistance,

The important missing baseline datum point is what sort of throughput range you get for the single thread, single gpu case. I stress range because you are using a multi-gpu numa machine, and depending on how the CPU-host memory-GPU affinity triumvirate resolves itself, there might be a lot of variation even for that simplest case.

The is a long potential list of issues that can effect scalability in multi-gpu codes. I have not had trouble tuning my multi-gpu apps up to hit whatever is the system bottleneck (usually the PCI-e or QPI/Hyper transport bandwidth limit in my codes). Tuning to maximize throughput probably isn’t the sort of thing that can be done via forum posts.

Thanks for the reply. I ran 11 tests processing 10 seconds of data in a single host thread on a single gpu which is explicitly gpu 0. The results are as follows: max 1614 host iterations per second(hips), min 1522 hips, average 1570 hips and standard deviation 27 hips. I would say those are fairly stable results.

I doubt it is PCI-e bandwidth that I am hitting because I am able double my throughput by moving half of my host threads to a different application process space. If it were PCI-e limitations, I would expect that not to matter. I confess that I don’t know much about QPI/Hyper transport. So, I’ll research it to see if that could be what is biting me.