Performance of 4 Pascal Titan X cards scales poorly using data parallelism

ekc · September 13, 2016, 9:36pm

Hello,

I would like to process a large batch of data on 4 cards in parallel by chunking the data set into 4 separate chunks and having each card process a single chunk in its own process. However, performance does not scale as expected. 2 cards is only about 10% faster than 1 card, and 3 and 4 cards provide basically no performance gain.

My setup:

Ubuntu 16.04
Driver 367.44
4 Pascal Titan X cards
CUDA v8.0.26
Running latest version of Caffe for deep learning

Since these are independent processes run on separate cards, I wouldn’t expect the load on one card to have such an effect on the performance of the other cards. I’ve done something similar using the AWS g2.8xlarge instance and seen much better performance exploiting data parallelism across 4 cards there.

Anyone else tried to do this? What could be causing the problem? Drivers?

Thanks!

scottgray · September 13, 2016, 9:48pm

One thing you might want to look at is running something like this:

nvidia-smi -i 0 --loop-ms=333 --format=csv,noheader --query-gpu=power.draw,clocks.gr,temperature.gpu,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown

I’ve found that there’s a bug in the Pascal TitanX causing the power limit to be internally limited. The value that’s displayed by nvidia-smi looks normal (250). If under load you see the clock running slow with the sw_power_cap active reset the power limit:

sudo nvidia-smi -pl 250

This may only happen on certain motherboards.

Another thing to look at is if Caffe is leveraging the nccl lib:

tera · September 13, 2016, 9:55pm

How much data is transferred over PCIe? I have one computer (from a large renowned manufacturer) that somehow only moves 1Gb/s over it’s PCIe 2.0 x16 bus.

ekc · September 13, 2016, 10:02pm

Shouldn’t be much data to transfer. Sending images, so roughly 7204803 = 1 mb per image. My total process takes about 0.4 sec / image on one card, so the transfer time seems small, relatively.

ekc · September 13, 2016, 10:15pm

Thanks, Scott!! I just ran the test, and here’s what I get under load:

…
169.47 W, 1822 MHz, 57, 32 %
176.32 W, 1822 MHz, 57, 32 %
121.12 W, 1822 MHz, 57, 32 %
119.68 W, 1822 MHz, 57, 32 %
174.70 W, 1822 MHz, 57, 32 %
131.42 W, 1822 MHz, 57, 32 %
167.50 W, 1822 MHz, 58, 32 %
165.31 W, 1822 MHz, 57, 32 %
111.73 W, 1822 MHz, 58, 32 %
87.15 W, 1822 MHz, 57, 32 %
…

If the power limit you mentioned is active, do you expect to see an additional message in that output?

scottgray · September 13, 2016, 10:32pm

The bug I observed was the clock stuck at like 100 MHz. Those numbers look fine.