Computational speed of High-End and Medium-End Nvidia Tesla cards

Dear Community,
this is Max from Italy.
As part of my University research activities, I am working on a software for RF propagation simulations. The code developed by another research group in my University makes extensive use of GPU computing (TCC form) and CUDA code for parallelization.

For my activities, I have two workstations with nearly identical specs in terms of RAM and Xeon CPU. The only difference is one having a Tesla P100 and the other one having a Tesla K40 card installed. Generally speaking I would expect the P100 card performances in terms of computational speed to be better that the K40 one. According to my research activities, this seems to be true as long as the simulation environment is “demanding”, i.e: challenging urban scenario with many propagation rays to be calculated and managed. On the contrary, if the scenario is “simple”, i.e: few buildings and not so many rays, well then the P100 card is not as efficient as the K40, which proves to run the simulation faster. I am not that expert in CUDA programming and the Nvidia Tesla HW specs of these two cards, but can any of you provide a reasonable explanation of this behaviour ? How come a P100 card is slower than a K40 for small, simple scenarios ? Is there a kind of “overhead” to be aware of ?

Do you run the same binaries on both cards? Have the binaries been built specifically to support the respective architectures (Pascal and Kepler)?

If you build this software from source code, make sure you’re targeting the right architecture with the nvcc options (-gencode and -arch options). Also try out various CUDA revisions, e.g., 7.5, 8.0 and 9.0 to see which ones work fastest on the respective architectures.

Maybe the software itself isn’t tuned right for simpler scenarios. It might only be partially utilizing the available GPU resources in case the scenario is very basic. It might end up launching a low number of threads and blocks, possibly running much less efficient on once card vs. another (depending on how many multiprocessors the respective card offers in total)

Different problem complexities may scale differently to the available L1 or L2 cache sizes on each GPU architecture. The basic scenario (e.g. the geometric building descriptions) might just fit into the cache on one card, but not so well on another.

Overall it’s hard to pinpoint efficiency problems without running a profiling session. It’s pure guesswork before one has seen the nVidia CUDA profiler output.