I implement an iterative algorithm using CUDA 8.0 and three Titan X cards on Ubuntu 16.04. I got a good performance (100 seconds comparing with 300 seconds on 40 cores/80 threads CPU).
After tests, I deploy the app onto a DGX-1 GPU server (nvidia-docker 2, one Tesla V100 card with drive version: 410, cuda version: 10.0) by a GPU container with CUDA 8.0 (no drive). Unfortunately, the performance is much worse than my original Titan version (500 seconds).
Would you please to give me some hints on what’s wrong with my experiment?