I have V and Xp installed in the same computer. I ran the same program, network , images on both GPUs separate. The Xp took about 15 mins to do one epoch while the V took 18 mins. I checked the mother board specs, each is in PCI 16 slot. When I ran nvidia-smi the V was stating 99% GPU utilization and the Xp was stating 100%. The real difference that stood out to me was the wattage and temperature was significantly greater in the Xp. I’m running in Ubuntu… What else can I look at to figure out why V is slower?
If I had to guess I’d say it’s because the Xp has 30 SMs compared to the TitanV’s 80 SMs. Sometimes your dimensions are such that it’s hard to tile work across that many cores. Increasing the batch size might help here.
make sure your code is compiled for sm_70 when running on Titan V, and make sure you are using latest CUDA.
Maybe late to discussion, but I’ve had a similar experience of TITAN V being slower than TITAN Xp for training. Apparently NVIDIA lowered the FP32 CUDA clock speed on TITAN V. https://devtalk.nvidia.com/default/topic/1036962/titan-v-max-clock-speed-locked-to-1-335-mhz-and-underperforms-titan-xp-ubuntu-16-04-nvidia-390-amp-396-/?offset=3. EDIT: NVIDIA now unlocked the TITAN V CUDA clock speed starting from 415.25 https://devtalk.nvidia.com/default/topic/1042047/container-tensorflow/titan-v-slower-than-1080ti-tensorflow-18-08-py3-and-396-54-drivers/post/5305096/#5305096
I am finally get back to this because I want to use both GPUs.
I ran the exact program on the exact same images. I just ran it twice once with os.environ[‘CUDA_VISIBLE_DEVICES’] =’0’ and again with os.environ[‘CUDA_VISIBLE_DEVICES’] =’1’
The Titan V memory usage is reporting to be much more than the Xp (.8G on Xp vs 5.5G on V) and it 6x slower(11msec on Xp vs 68msec on V). I spawned two of each on both GPUs so that I was running 4 inferences at once this behavior was repeatable. Xp was like 1.6G and V was 11G.
I don’t remember seeing this memory usage behavior during training, but I also didn’t know it might be important.
The cnn that I am using is PyTorch based… image is (512,512)
I also ran this on a 1050 on a different computer and it is somewhere around 32msec, the memory usage on this is under 1G as well.