I would like to process a large batch of data on 4 cards in parallel by chunking the data set into 4 separate chunks and having each card process a single chunk in its own process. However, performance does not scale as expected. 2 cards is only about 10% faster than 1 card, and 3 and 4 cards provide basically no performance gain.
4 Pascal Titan X cards
Running latest version of Caffe for deep learning
Since these are independent processes run on separate cards, I wouldn’t expect the load on one card to have such an effect on the performance of the other cards. I’ve done something similar using the AWS g2.8xlarge instance and seen much better performance exploiting data parallelism across 4 cards there.
Anyone else tried to do this? What could be causing the problem? Drivers?
I’ve found that there’s a bug in the Pascal TitanX causing the power limit to be internally limited. The value that’s displayed by nvidia-smi looks normal (250). If under load you see the clock running slow with the sw_power_cap active reset the power limit:
sudo nvidia-smi -pl 250
This may only happen on certain motherboards.
Another thing to look at is if Caffe is leveraging the nccl lib: