I would like to process a large batch of data on 4 cards in parallel by chunking the data set into 4 separate chunks and having each card process a single chunk in its own process. However, performance does not scale as expected. 2 cards is only about 10% faster than 1 card, and 3 and 4 cards provide basically no performance gain.
My setup:
Ubuntu 16.04
Driver 367.44
4 Pascal Titan X cards
CUDA v8.0.26
Running latest version of Caffe for deep learning
Since these are independent processes run on separate cards, I wouldn’t expect the load on one card to have such an effect on the performance of the other cards. I’ve done something similar using the AWS g2.8xlarge instance and seen much better performance exploiting data parallelism across 4 cards there.
Anyone else tried to do this? What could be causing the problem? Drivers?
I’ve found that there’s a bug in the Pascal TitanX causing the power limit to be internally limited. The value that’s displayed by nvidia-smi looks normal (250). If under load you see the clock running slow with the sw_power_cap active reset the power limit:
sudo nvidia-smi -pl 250
This may only happen on certain motherboards.
Another thing to look at is if Caffe is leveraging the nccl lib:
How much data is transferred over PCIe? I have one computer (from a large renowned manufacturer) that somehow only moves 1Gb/s over it’s PCIe 2.0 x16 bus.
Shouldn’t be much data to transfer. Sending images, so roughly 7204803 = 1 mb per image. My total process takes about 0.4 sec / image on one card, so the transfer time seems small, relatively.