Titan V is slower than Titan Xp when batch size is small ???

Recently I’m trying to benchmark the new Titan V card and compare with Titan Xp. My initial expectation is that Titan V will always be faster than Titan Xp in most circumstances. But to my surprise, I found that Titan V is even slower than Xp when batch size == 1.

  • Driver Version: 384.111

  • CUDA Version: CUDA9.0 + CUDNN7.1.4

  • Framework: Tensorflow 1.10 and NVCaffe 0.17.

  • Test Method: <tensorflow/benchmarks> and

  • Test Results on Tensorflow (Units is ms/img, smaller is better)

    batch_size = 16 VGG16 Inception V3 ResNet50
    Titan V 6.0 7.0 4.7
    Titan Xp 7.5 8.3 5.5
    batch_size = 1 VGG16 Inception V3 ResNet50
    Titan V 26.3 43.5 32.3
    Titan Xp 30.3 38.5 28.6
  • Test Results on NVcaffe

    batch size = 16 VGG16 Inception V3 ResNet50
    Titan V 30.4 47.1 35.8
    Titan Xp 33.9 53 41.2
    batch size = 1 VGG16 Inception V3 ResNet50
    Titan V 6.53 24.8 14.1
    Titan Xp 5.45 22.6 12.1
  • As we can see, when batchsize = 16, Titan V is constantly faster than Titan Xp by a fraction of 10-15%, however, when batchsize = 1, Titan Xp is even faster than Titan V under Inception V3 and ResNet50.

  • How could Titan V perform worse than Titan Xp? Is it because my benchmark method is problematic or something else? I’ve googled most of the Titan V performance review articles, and they all report latency on a large batch size, so I have no idea whether this is expected behavior?